Abstract

As the rapid development of Internet technology, Deep Web has the vast amounts of data information, and in the rapid growth of the Web to become a huge data source. Many documents share common HTML tree structure on script generated websites, allowing users to effectively extract interested information from deep webpages by wrappers. However, since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing adaptive wrappers in deep webpages. In order to keep web extraction robust when webpages changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, three edit operations under structural changes are considered, i.e. inserting nodes, deleting nodes and substituting nodes' labels. By obtaining the extraction model for 51job site and then random sampling pages at zhaopin site using this extraction model for training the new wrapper. Besides, the wrapper has high versatility, realizing the adaptation extraction. Experimental results show that the proposed approach can improve the extraction accuracy of target data and effectively solve the adaptive wrapper for the massive Deep Web data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call