Abstract

Record de-duplication is a process of identification and removal of duplicates from the given dataset in a data warehouse environment. The term record linkage is also used in the same context, the difference between record de-duplication and record linkage is that the former is used when the duplicates are to be removed from one dataset while the later is used when the duplicates are to be removed from several different datasets that refer to the same entity. Both the processes de-duplication and record linkage are important during data profiling stage of a data warehouse and assure the quality of data without repetition which in turn leads to better decision making. Record de-duplication is focused for the presented research. The Efficiency of Record de-duplication is based on several criteria such as number of comparisons needed, time and cost of comparison, accuracy level of de- duplication, time and space complexity for identification of true duplicates. In this paper we have explored the several indexing techniques which are intended to make less number of comparisons to identify duplicates from the given dataset. Peter Christen has surveyed and experimented six different indexing techniques [1] such as Sorted Neighborhood indexing, Suffix Array indexing, Q Gram based indexing, Canopy Clustering, Threshold based indexing, and String Map based indexing. In this paper, we have studied and implemented Sorted Neighborhood based de-duplication techniques in detail. During this implementation Adaptive and Non-Adaptive Sorted Neighborhood Methods are experimented and validated. Accumulative Adaptive SNM (AASNM), Incrementally Adaptive SNM (IASNM)[16] are adaptive versions of SNM while Duplicate Count Strategy (DCS) [4] is a Non Adaptive SNM. A Group based Accumulative Adaptive Method (GAASNM) is proposed to minimize the record comparisons.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.