Optimal schemes for robust web extraction

Aditya Parameswaran,Hector Garcia-Molina,Rajeev Rastogi,Nilesh Dalvi

doi:10.14778/3402707.3402735

Optimal schemes for robust web extraction

Aditya Parameswaran, Hector Garcia-Molina + Show 2 more

Open Access

https://doi.org/10.14778/3402707.3402735

Copy DOI

Journal: Proceedings of the VLDB Endowment	Publication Date: Aug 1, 2011
Citations: 25

Affiliation: Stanford University, Yahoo (Spain)

#Real Websites #Web Information Extraction + Show 4 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In this paper, we consider the problem of constructing wrappers for web information extraction that are robust to changes in websites. We consider two models to study robustness formally: theadversarialmodel, where we look at the worst-case robustness of wrappers, andprobabilisticmodel, where we look at the expected robustness of wrappers, as web-pages evolve. Under both models, we present efficient algorithms for constructing the provably most robust wrapper. By evaluating on real websites, we demonstrate that in practice, our algorithms are highly effective in coping up with changes in websites, and reduce the wrapper breakage by up to 500% over existing techniques.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Proceedings of the VLDB Endowment

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.