Abstract

Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-duplicate records of multi-language data, specially for Chinese/English mixed Web data. The key ideas are: (1)Pre-processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques; (2) Pair-wise comparison method based on domain- specific similarities, especially, the string kernel method; (3)a priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale. Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call