Automatic Web News Content Extraction

Gusti Lanang Putra Eka Prismana

doi:10.36418/jrssem.v1i7.107

Abstract

The extraction of the main content of web pages is widely used in search engines, but a lot of irrelevant information, such as advertisements, navigation, and junk information, is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. This study aimed to extract web pages using DOM Tree in the rationality of segmentation results and efficiency based on the information entropy of nodes from the DOM Tree. The first step of this research was to classify web page tags and only processed tags that affected the structure of the page. The second step was to consider the content features and structural features of the DOM Tree node comprehensively. The next was to perform node fusion to obtain segmentation results. Segmentation testing was carried out with several web pages with different structures so that it showed that the proposed method accurately and quickly segmented and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would be matched with the database to eliminate information noise using the Firefly Optimization algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect were done to detect and eliminate web page noise and produce clear documents.

Full Text