Abstract

Abstract Data reconciliation is the process of matching records across different databases. Data reconciliation requires “joining” on fields that have traditionally been non-key fields. Generally, the operational databases are of sufficient quality for the purposes for which they were initially designed but since the data in the different databases do not have a canonical structure and may have errors, approximate matching algorithms are required. Approximate matching algorithms can have many different parameter settings. The number of parameters will affect the complexity of the algorithm due to the number of comparisons needed to identify matching records across different datasets. For large datasets that are prevalent in data warehouses, the increased complexity may result in impractical solutions. In this paper, we describe an efficient method for data reconciliation. Our main contribution is the incorporation of machine learning and statistical techniques to reduce the complexity of the matching algorithms via identification and elimination of redundant or useless parameters. We have conducted experiments on actual data that demonstrate the validity of our techniques. In our experiments, the techniques reduced complexity by 50% while significantly increasing matching accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call