Abstract

The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Various ways have been proposed in recent days to address the problem; however each technique has one or more flaws that prohibit it from being successfully implemented. For this task, we offer an online machine learning method that incrementally learns a composite similarity function based on a linear combination of basic functions. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. This paper advocates to usage of X - Query extension functions for XML data cleaning application scenarios, and detail the implementation on top of an existing X-Query engine. The quantitative measures of common elements in medical records are considered in this article, and fuzzy logic is used to link to language notions. Duplication Detection and Incompleteness Resolution (DDIR) approach has been proposed to improve the quality of the end users’ data. Record Linkage and Weighted Component Similarity Summing (WCSS) approach are used to detect and remove the duplicate records. The fuzzy logic framework is a powerful tool for dealing with linkage issues. The described multiple valued logic method could be used to solve similar problems in different databases. The normalized URLs are tokenized and pattern tree is constructed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call