Abstract
Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.