Entity Resolution: Overview and Challenges

Hector Garcia-Molina

doi:10.1007/978-3-540-30464-7_1

Abstract

AbstractEntity resolution is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An entity resolution algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can. Entity resolution algorithms typically rely on user-defined functions that (a) compare fields or records to determine if they match (are likely to represent the same real world entity), and (b) merge matching records into one, and in the process perhaps combine fields (e.g., creating a new name based on two slightly different versions of the name).In this talk I will give an overview of the Stanford SERF Project, that is building a framework to describe and evaluate entity resolution schemes. In particular, I will give an overview of some of the different entity resolution settings: De-duplication versus fidelity enhancement. In the de-duplication problem, we have a single set of records, and we try to merge the ones representing the same real world entity. In the fidelity enhancement problem, we have two sets of records: a base set of records of interest, and a new set of acquired information. The goal is to coalesce the new information into the base records. Clustering versus snapping. With snapping, we examine records pair-wise and decide if they represent the same entity. If they do, we merge the records into one, and continue the process of pair-wise comparisons. With clustering, we analyze all records and partition them into groups we believe represent the same real world entity. At the end, each partition is merged into one record. Confidences. In some entity resolution scenarios we must manage confidences. For example, input records may have a confidence value representing how likely it is they are true. Snap rules (that tells us when two records match) may also have confidences representing how likely it is that two records actually represent the same real world entity. As we merge records, we must track their confidences. Schema Mismatches. In some entity resolution scenarios we must deal, not just with resolving information on entities, but also with resolving discrepancies among the schemas of the different sources. For example, the attribute names and formats from one source may not match those of other sources.

Full Text