Data De-duplication: A Review

Gianni Costa,Riccardo Ortale,Giuseppe Manco,Alfredo Cuzzocrea

doi:10.1007/978-3-642-22913-8_18

Abstract

The wide exploitation of new techniques and systems for generating, collecting and storing data has made available growing volumes of information. Large quantities of such information are stored as free texts. The lack of explicit structure in free text is a major issue in the categorization of such kind of data for more effective and efficient information retrieval, search and filtering. The abundance of structured data is problematic too. Several databases are available, that contain data of the same type. Unfortunately, they often conform to different schemas, which avoids the unified management of even structured information. The Entity Resolution process plays a fundamental role in the context of information integration and management, aimed to infer a uniform and common structure from various large-scale data collections, with which to suitably organize, match and consolidate the information of the individual repositories into one data set. De-duplication is a key step of the Entity Resolution process, whose goal is discovering duplicates within the integrated data, i.e., different tuples that, as a matter of facts, refer to the same real-world entity. This attenuates the redundancy of the integrated data and, also, enables more effective information handling and knowledge extraction through a unified access to reconciled and de-duplicated data. Duplicate detection is an active research area that benefits from contributions from diverse research fields, such as, machine learning, data mining and knowledge discovery, databases as well as information retrieval and extraction. This chapter presents an overview of research on data de-duplication, with the goal of providing a general understanding and useful references to fundamental concepts concerning the recognition of similarities in very large data collections. For this purpose, a variety of state-of-the-art approaches to de-duplication is reviewed. The discussion of the state-of-the-art conforms to a taxonomy that, at the highest level, divides the existing approaches into two broad classes, i.e., unsupervised and supervised approaches. Both classes are further divided into sub-classes according to the common peculiarities of the involved approaches. The strengths and weaknesses of each group of approaches are presented. Meaningful research developments to further advance the current state-of-the-art are covered as well.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data De-duplication: A Review

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

DDFP: Duplicate detection and fragment placement in deduplication system for security and storage space
Jayashri Patil ... Sunita S Barve
-
Jayashri Patil, et. al.Jayashri Patil ... Sunita S Barve
01 Oct 2017
01 Oct 2017

FingerPrint Based Duplicate Detection in Streamed Data
Amritpal Singh ... Shalini Batra
Computing and Informatics | VOL. 37
Amritpal Singh, et. al.Amritpal Singh ... Shalini Batra
01 Jan 2018
Computing and Informatics | VOL. 37

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams
Shiwei Che ... Wu Yang
The International Arab Journal of Information Technology | VOL. 17
Shiwei Che, et. al.Shiwei Che ... Wu Yang
01 Sep 2020
The International Arab Journal of Information Technology | VOL. 17

Leveraging Localisation Techniques for In-Network Duplicate Event Data Detection and Filtering
Jakob Pfender ... Winston K.G Seah
-
Jakob Pfender, et. al.Jakob Pfender ... Winston K.G Seah
01 Oct 2017
01 Oct 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data De-duplication: A Review

Abstract

Talk to us

Similar Papers