Abstract

A deduplication process uses similarity function to identify the two entries are duplicate or not by setting the threshold. This threshold setting is an important issue to achieve more accuracy and it relies more on human intervention. Swarm Intelligence algorithm such as PSO and ABC have been used for automatic detection of threshold to find the duplicate records. Though the algorithms performed well there is still an insufficiency regarding the solution search equation, which is used to generate new candidate solutions based on the information of previous solutions. The proposed work addressed two problems: first to find the optimal equation using Genetic Algorithm(GA) and next it adopts an modified Artificial Bee Colony (ABC) to get the optimal threshold to detect the duplicate records more accurately and also it reduces human intervention. CORA dataset is considered to analyze the proposed algorithm.

Highlights

  • Knowledge Discovery in Databases (KDD) is the process of identifying valid, useful, and understandable patterns from large datasets [20]

  • The proposed approach used CORA Dataset which is commonly employed for evaluating duplicate record detection approaches

  • Cora Bibiliographic: This dataset contains 864 entries including 112 duplicates, that were that were taken from riddle repository

Read more

Summary

INTRODUCTION

Knowledge Discovery in Databases (KDD) is the process of identifying valid, useful, and understandable patterns from large datasets [20]. All records that have exactly or approximately the same data in one or more fields are identified as duplicates. GA combines different parts of evidence to find a duplicate record detection function [11]. This enables to identify whether two entries in a repository are duplicates or not. Since duplicate detection process is a time consuming process, the aim is to recommend a method that finds a proper combination of the best pieces of evidence, yielding a function that maximizes performance for training purposes [27]. In order to find the optimal threshold and to reduce the human intervention, the proposed work uses an intelligence algorithm, modified ABC.

REVIEW OF RELATED WORK
Genetic Algorithm
Simple Genetic Algorithm procedure
Exploration and Exploitation
Optimal threshold using modified ABC
Dataset Description
Results
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.