Dom Tree as the base for webpage content extraction: Review

Hind Sabah Rahim,Nahla Abbas Flayh,Aliea Salman Sabir

doi:10.29304/jqcm.2022.14.3.985

Abstract

Because of the fast advancement of internet technology in the last twenty years, which leads to a huge number of web pages that contain a massive amount of information in every domain, the volume of available information has been steadily expanding every minute, so the analyzing and extracting information from web pages is becoming increasingly crucial, add to that information in webpages in an unstructured or semi-structured format need to transform in a structured format.  Since it is hard to collect the information manually, scientists have devised a variety of methods to help extract information from different domains in an automatic way. the main information in web pages is mixed with a significant amount of unrelated information (noise) like advertisements, boxes with links to relevant material, boxes with photos or other media, top and/or side navigation bars, animated commercials, etc., effect on the performance of information extraction and web content analysis technologies. to eliminate the noise by using the Document Object Model (DOM) that can easily reach every tag in the structure of the webpages to extract the information or delete the noise. This article explores in-depth DOM tree-based approaches, such as HTML tags and the DOM tree, by reviewing works from 2011 to 2021 and comparing numerous elements comprehensively, including classifier methods, contribution, limitation, and evaluation metrics.

Full Text