Abstract

The data preparation and cleansing tasks necessary to ensure high quality data are among the most difficult challenges faced in data warehousing and data mining projects. The extraction of source data, transformation into new forms, and loading into a data warehouse environment are all time consuming tasks that can be supported by methodologies and tools. This paper focuses on the problem of record linkage or entity matching, tasks that can be very important in providing high quality data. Merging two or more large databases into a single integrated system is a difficult problem in many industries, especially in the wake of acquisitions. For example, managing customer lists can be challenging when duplicate entries, data entry problems, and changing information conspire to make data quality an elusive target. Common tasks with regard to customer lists include customer matching to reduce duplicate entries and household matching to group customers. These often O(n 2 ) problems can consume significant resources, both in computing infrastructure and human oversight, and the goal of high accuracy in the final integrated database can be difficult to assure. This paper distinguishes between attribute corruption and entity corruption, discussing the various impacts on quality. A metajoin operator is proposed and used to organize past and current entity matching techniques. Finally, a logistic regression approach to implementing the metajoin operator is discussed and illustrated with an example. The metajoin can be used to determine whether two records match, don't match, or require further evaluation by human experts. Properly implemented, the metajoin operator could allow the integration of individual databases with greater accuracy and lower cost.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.