Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles

Naseer Aslam,Muhammad Amir Mehmood,Hafiz Muhammad Shafiq,Bilal Tahir

doi:10.1109/fit47737.2019.00061

Naseer Aslam, Muhammad Amir Mehmood + Show 2 more

https://doi.org/10.1109/fit47737.2019.00061

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In general, web-pages contain extra information in the form of noise such as navigational elements, sidebars, and advertisements in addition to the main content. This noise is primarily not related to the main content and it impacts the data mining and information retrieval tasks. Detecting the relevant information and noise is challenging due to the diversity in the structure of web-pages. Researchers have built algorithms such as Boilerpipe and JustText to detect the noise present in web articles. In this paper, we present an algorithm "Web-AM" to remove noise from web articles using the HTML tree structure by extending the Boilerpipe Article Extractor algorithm. Although, Boilerpipe has very good performance in extracting the main content but it fails to filter the noise present inside the main article. We make the initial selection of main content using Boilerpipe and remove noise using that structure. The filtration of the main content and noise is performed on the basis of the text length and formatting. For the evaluation of Web-AM, we build our own corpus of Urdu language web articles (CURWEB). In addition, we use L3S-GN1 and CleanPortalEval for the evaluation. Our results show 3-21% improvement in Precision by Web-AM as compare to Boilerpipe and JustText algorithms. Web-AM can be effectively used for information retrieval, content summarization, and web-page classification tasks.

Full Text