Abstract

In general, web-pages contain extra information in the form of noise such as navigational elements, sidebars, and advertisements in addition to the main content. This noise is primarily not related to the main content and it impacts the data mining and information retrieval tasks. Detecting the relevant information and noise is challenging due to the diversity in the structure of web-pages. Researchers have built algorithms such as Boilerpipe and JustText to detect the noise present in web articles. In this paper, we present an algorithm "Web-AM" to remove noise from web articles using the HTML tree structure by extending the Boilerpipe Article Extractor algorithm. Although, Boilerpipe has very good performance in extracting the main content but it fails to filter the noise present inside the main article. We make the initial selection of main content using Boilerpipe and remove noise using that structure. The filtration of the main content and noise is performed on the basis of the text length and formatting. For the evaluation of Web-AM, we build our own corpus of Urdu language web articles (CURWEB). In addition, we use L3S-GN1 and CleanPortalEval for the evaluation. Our results show 3-21% improvement in Precision by Web-AM as compare to Boilerpipe and JustText algorithms. Web-AM can be effectively used for information retrieval, content summarization, and web-page classification tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.