Abstract

HTML is utilized widely in web information description and exhibition. Although new technologies continue appearing during the HTML history, the basic structure and principal of HTML remains the same and HTML is still an important part for tasks such as web development and even dynamic page exhibition. We currently have mainly two types of parsers for HTML, SAX and DOM. The problem is that, the former is driven by parsing events but can only access the nodes sequentially with a slow speed, and the latter should load the whole document into memory and will consume a lot of space. In order to solve such problem, we proposed an automatic batch extraction method for specific content of HTML based on tag locations. The extraction process can be divided to two main steps, the first step is locating the start and end positions of HTML tags, the second step is finding the desired content based on the location of tags and corresponding attribute information. The first step is the core of the whole process. An example of extraction of specific content of a search result page verifies the proposed algorithm. The proposed algorithm can be further used for advanced tasks such as data mining and knowledge base establishment.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.