Web content extraction based on maximum continuous sum of text density

Kai Sun,Miao Li,Sha Fu,Lei Chen,Zhengxin Yang,Jinhua Du,Yi Gao

doi:10.1109/ialp.2016.7875988

Abstract

Generally different websites have different web page structures, which would heavily affect the extraction quality when the web content is automatically collected. On the basis of a statistical analysis on content features and structure characteristics of News domain web pages, this paper proposes a maximum continuous sum of text density (MCSTD) method to efficiently and effectively extract web content from different web pages. Firstly, web pages are preprocessed, and then the text density of texts are calculated. Finally, the web content is extracted using the proposed MCSTD method. Experimental results show that the extraction precision is over 95%, and the proposed approach is more efficient and easier to be implemented compared to traditional models. Additionally, our method has also been applied to the scenario of comparable corpora construction using extracted web resource.

Full Text