Deep web data source selection based on subject and probability model

Song Deng Song Deng

doi:10.1109/imcec.2016.7867557

Abstract

The users wish to search for fewer data sources and retrieve better quality results, so the data source selection becomes the core technology in the deep web data integration. In the data source selection, it normally considers both the data source correlation to the user's query and the document content duplication. We propose a new two step data source selection strategy by first ranking on data source correlation, then adjusting them by document content duplication. We firstly get source correlation scores based on sample document ranks and optimizes the accuracy of correlation scores based on the subject content correlation deviation probability modeling; and finally improves the data source ranks based on the subject content duplication. The test results shows our method has a better data source selection accuracy rate and document recall rate when the data source selection is based on a few sampling documents.

Full Text