Efficient data reconciliation

Munir Cochinwala,Gail Lalk,Verghese Kurien,Dennis Shasha

doi:10.1016/s0020-0255(00)00070-0

Abstract

Abstract Data reconciliation is the process of matching records across different databases. Data reconciliation requires “joining” on fields that have traditionally been non-key fields. Generally, the operational databases are of sufficient quality for the purposes for which they were initially designed but since the data in the different databases do not have a canonical structure and may have errors, approximate matching algorithms are required. Approximate matching algorithms can have many different parameter settings. The number of parameters will affect the complexity of the algorithm due to the number of comparisons needed to identify matching records across different datasets. For large datasets that are prevalent in data warehouses, the increased complexity may result in impractical solutions. In this paper, we describe an efficient method for data reconciliation. Our main contribution is the incorporation of machine learning and statistical techniques to reduce the complexity of the matching algorithms via identification and elimination of redundant or useless parameters. We have conducted experiments on actual data that demonstrate the validity of our techniques. In our experiments, the techniques reduced complexity by 50% while significantly increasing matching accuracy.

Full Text