Skip to main content

Research Repository

Advanced Search

Entropy-based automated wrapper generation for weblog data extraction

Gkotsis, George; Stepanyan, Karen; Cristea, A.I.; Joy, Mike

Entropy-based automated wrapper generation for weblog data extraction Thumbnail


Authors

George Gkotsis

Karen Stepanyan

Mike Joy



Abstract

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

Citation

Gkotsis, G., Stepanyan, K., Cristea, A., & Joy, M. (2013). Entropy-based automated wrapper generation for weblog data extraction. World Wide Web, 17(4), 827-846. https://doi.org/10.1007/s11280-013-0269-6

Journal Article Type Article
Acceptance Date Nov 4, 2013
Online Publication Date Nov 21, 2013
Publication Date Nov 21, 2013
Deposit Date Jul 11, 2018
Publicly Available Date Jul 31, 2018
Journal World Wide Web
Print ISSN 1386-145X
Electronic ISSN 1573-1413
Publisher Springer
Peer Reviewed Peer Reviewed
Volume 17
Issue 4
Pages 827-846
DOI https://doi.org/10.1007/s11280-013-0269-6
Related Public URLs http://wrap.warwick.ac.uk/61827/

Files





You might also like



Downloadable Citations