Abstract

In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.