FDup: a framework for general-purpose and efficient entity deduplication of record collections.

Michele De Bonis,Claudio Atzori,Paolo Manghi

doi:10.7717/peerj-cs.1058

Michele De Bonis, Claudio Atzori + Show 1 more

Open Access

https://doi.org/10.7717/peerj-cs.1058

Copy DOI

Abstract

Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of “blocking” and “sliding window”, by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ. Computer science	Publication Date: Sep 6, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

FDup: a framework for general-purpose and efficient entity deduplication of record collections.

Abstract

Talk to us

Similar Papers

More From: PeerJ. Computer science

Lead the way for us

Similar Papers

An open access medical knowledge base for community driven diagnostic decision support system development
Lars Müller ... Sanjay Mehta
BMC Medical Informatics and Decision Making | VOL. 19
Lars Müller, et. al.Lars Müller ... Sanjay Mehta
27 Apr 2019
BMC Medical Informatics and Decision Making | VOL. 19

CamurWeb: a classification software and a large knowledge base for gene expression data of cancer
Emanuel Weitschek ... Paola Bertolazzi
BMC bioinformatics | VOL. 19
Emanuel Weitschek, et. al.Emanuel Weitschek ... Paola Bertolazzi
01 Oct 2018
BMC bioinformatics | VOL. 19

TracerDB: a crowdsourced fluorescent tracer database for target engagement analysis
Johannes Dopfer ... Martin P Schwalm
Nature Communications | VOL. 15
Johannes Dopfer, et. al.Johannes Dopfer ... Martin P Schwalm
05 Jul 2024
Nature Communications | VOL. 15

Data Science in Healthcare: Implications for Early Career Investigators.
Sanjeev P Bhavnani ... Daniel Muñoz
Circulation: Cardiovascular Quality and Outcomes | VOL. 9
Sanjeev P Bhavnani, et. al.Sanjeev P Bhavnani ... Daniel Muñoz
01 Nov 2016
Circulation: Cardiovascular Quality and Outcomes | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FDup: a framework for general-purpose and efficient entity deduplication of record collections.

Abstract

Talk to us

Similar Papers

More From: PeerJ. Computer science