Abstract

The users wish to search for fewer data sources and retrieve better quality results, so the data source selection becomes the core technology in the deep web data integration. In the data source selection, it normally considers both the data source correlation to the user's query and the document content duplication. We propose a new two step data source selection strategy by first ranking on data source correlation, then adjusting them by document content duplication. We firstly get source correlation scores based on sample document ranks and optimizes the accuracy of correlation scores based on the subject content correlation deviation probability modeling; and finally improves the data source ranks based on the subject content duplication. The test results shows our method has a better data source selection accuracy rate and document recall rate when the data source selection is based on a few sampling documents.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call