A machine learning approach to create blocking criteria for record linkage.

Phan H Giang

doi:10.1007/s10729-014-9276-0

Abstract

Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor.

Full Text