Abstract

In general, web-pages contain extra information in the form of noise such as navigational elements, sidebars, and advertisements in addition to the main content. This noise is primarily not related to the main content and it impacts the data mining and information retrieval tasks. Detecting the relevant information and noise is challenging due to the diversity in the structure of web-pages. Researchers have built algorithms such as Boilerpipe and JustText to detect the noise present in web articles. In this paper, we present an algorithm "Web-AM" to remove noise from web articles using the HTML tree structure by extending the Boilerpipe Article Extractor algorithm. Although, Boilerpipe has very good performance in extracting the main content but it fails to filter the noise present inside the main article. We make the initial selection of main content using Boilerpipe and remove noise using that structure. The filtration of the main content and noise is performed on the basis of the text length and formatting. For the evaluation of Web-AM, we build our own corpus of Urdu language web articles (CURWEB). In addition, we use L3S-GN1 and CleanPortalEval for the evaluation. Our results show 3-21% improvement in Precision by Web-AM as compare to Boilerpipe and JustText algorithms. Web-AM can be effectively used for information retrieval, content summarization, and web-page classification tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call