Abstract

Generally different websites have different web page structures, which would heavily affect the extraction quality when the web content is automatically collected. On the basis of a statistical analysis on content features and structure characteristics of News domain web pages, this paper proposes a maximum continuous sum of text density (MCSTD) method to efficiently and effectively extract web content from different web pages. Firstly, web pages are preprocessed, and then the text density of texts are calculated. Finally, the web content is extracted using the proposed MCSTD method. Experimental results show that the extraction precision is over 95%, and the proposed approach is more efficient and easier to be implemented compared to traditional models. Additionally, our method has also been applied to the scenario of comparable corpora construction using extracted web resource.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call