Abstract

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F 1 , cluster F 1 ) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GMD ) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GMD is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GMD measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GMD measure, and the widely used pairwise F 1 measure can be directly computed using GMD . We present an efficient linear-time algorithm that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call