Gkotsis, George and Stepanyan, Karen and Cristea, A. I. and Joy, Mike (2013) 'Entropy-based automated wrapper generation for weblog data extraction.', World Wide Web., 17 (4). 827-846 .
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
|Full text:||(AM) Accepted Manuscript|
Download PDF (1641Kb)
|Publisher Web site:||http://dx.doi.org/10.1007/s11280-013-0269-6|
|Publisher statement:||The final publication is available at Springer via https://doi.org/10.1007/s11280-013-0269-6|
|Date accepted:||04 November 2013|
|Date deposited:||31 July 2018|
|Date of first online publication:||21 November 2013|
|Date first made open access:||No date available|
Save or Share this output
|Look up in GoogleScholar|