With the exponential growth of the internet, an abundance of information has become readily available. Extracting valuable data from the web is crucial for applications such as meta-querying and comparison shopping. However, the heterogeneous nature of web information poses a significant challenge to the extraction process. The web can be classified into the surface or visible web and the deep or invisible web. While conventional search engines can index the surface web, they fall short when it comes to the deep web. To access the deep web, users must submit queries to web databases, and the results are encapsulated in dynamically generated web pages containing data records. Traditional search engines struggle to index these dynamic pages, necessitating a specialized program for efficient information extraction from the deep web. Web search engines generate result pages based on user queries, making it crucial to automatically extract data from these pages for various applications. In this context, we propose an innovative data extraction method called Effective Data Extraction using Preprocessing (EDEP). The EDEP approach begins by parsing the input HTML page, constructing a tag tree, and subsequently eliminating irrelevant tags from the tree. Notably, our system efficiently handles scenarios where auxiliary information, such as recommendations or comments, is intermixed between query result records (QRRs), causing them to be non-contiguous. EDEP also effectively manages result pages containing single QRRs. Through experimental results, it is evident that EDEP outperforms existing data extraction methods, showcasing its efficacy in handling the complexities associated with web data extraction.
Read full abstract