Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to the problem of automatic wrapper maintenance. It is based on the truth that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and content of the extracted data items. The approach uses these preserved features to identify the locations of the desired values in the changed pages, then the wrappers can be repaired. The experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with accuracies.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.