We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham Research Online
You are in:

Entropy-based automated wrapper generation for weblog data extraction.

Gkotsis, George and Stepanyan, Karen and Cristea, A. I. and Joy, Mike (2013) 'Entropy-based automated wrapper generation for weblog data extraction.', World Wide Web., 17 (4). 827-846 .


This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

Item Type:Article
Full text:(AM) Accepted Manuscript
Download PDF
Publisher Web site:
Publisher statement:The final publication is available at Springer via
Date accepted:04 November 2013
Date deposited:31 July 2018
Date of first online publication:21 November 2013
Date first made open access:No date available

Save or Share this output

Look up in GoogleScholar