TA-DRD: A Three-step Automatic Duplicate Record Detection

Yongquan Dong,Ping Ling,Qiang Chu,Yali Liu

doi:10.2174/1874444301406011277

Abstract

Duplicate record detection is a key step in Deep Web data integration, but the existing approaches do not adapt to its large-scale nature. In this paper, a three-step automatic approach is proposed for duplicate record detection in Deep Web. It firstly uses cluster ensemble to select initial training instance. Then it utilizes tri-training classification to con- struct classification model. Finally, it uses evidence theory to combine the results of multiple classification models to con- struct the domain-level duplicate record detection model which can be used for large-scale duplicate record detection in the same domain. Experimental results show that the proposed approach is better than previous work and and the domain- level duplicate record detection model can get high performance.

Full Text