Automatic Batch Extraction of Specific Content of HTML Based on Tag Locations

Yi Tao Chi,Hao Jiang Gao,Ping Guo,Xiu Bao Zhang,Zhi Guang Zhang

doi:10.4028/www.scientific.net/amm.602-605.3826

Abstract

HTML is utilized widely in web information description and exhibition. Although new technologies continue appearing during the HTML history, the basic structure and principal of HTML remains the same and HTML is still an important part for tasks such as web development and even dynamic page exhibition. We currently have mainly two types of parsers for HTML, SAX and DOM. The problem is that, the former is driven by parsing events but can only access the nodes sequentially with a slow speed, and the latter should load the whole document into memory and will consume a lot of space. In order to solve such problem, we proposed an automatic batch extraction method for specific content of HTML based on tag locations. The extraction process can be divided to two main steps, the first step is locating the start and end positions of HTML tags, the second step is finding the desired content based on the location of tags and corresponding attribute information. The first step is the core of the whole process. An example of extraction of specific content of a search result page verifies the proposed algorithm. The proposed algorithm can be further used for advanced tasks such as data mining and knowledge base establishment.

Full Text