Gkotsis, George and Stepanyan, Karen and Cristea, A. I. and Joy, Mike (2013) 'Entropy-based automated wrapper generation for weblog data extraction.', World Wide Web., 17 (4). 827-846 .
Abstract
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
Item Type: | Article |
---|---|
Full text: | (AM) Accepted Manuscript Download PDF (1641Kb) |
Status: | Peer-reviewed |
Publisher Web site: | http://dx.doi.org/10.1007/s11280-013-0269-6 |
Publisher statement: | The final publication is available at Springer via https://doi.org/10.1007/s11280-013-0269-6 |
Date accepted: | 04 November 2013 |
Date deposited: | 31 July 2018 |
Date of first online publication: | 21 November 2013 |
Date first made open access: | No date available |
Save or Share this output
Export: | |
Look up in GoogleScholar |