Automating Duplicate Detection for Lexical Heterogeneous Web Databases

Anil Ahlawat,Kalpna Sagar

doi:10.2174/2666255813999200904170035

Abstract

Introduction: The need for efficient search engines has been identified with the everincreasing technological advancement and huge growing demand for data on the web. Method: Automating duplicate detection over a query results in identifying the records from multiple web databases that point to a similar real-world entity and return non-matching records to the end-users. The proposed algorithm in this paper is based on an unsupervised approach with classifiers over heterogeneous web databases that return more accurate results with high precision, Fmeasure, and recall. Different assessments have also been executed to analyze the efficacy of the proposed algorithm for the identification of duplicates. Result: Results show that the proposed algorithm has greater precision, F-score measure, and the same recall values as compared to standard UDD. Conclusion: This paper aims to introduce an algorithm that automates the process of duplicate detection for lexical heterogeneous web databases. Discussion: This paper concludes that the proposed algorithm outperforms the standard UDD.

Full Text