Abstract

Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call