Abstract

Entity Resolution (ER) is a data quality challenge that deals with ambiguous references in data and whose task is to identify all references that co-refer. Due to practical significance of the ER problem, many creative ER techniques have been proposed in the past, including those that analyze relationships that exist among entities in data. Such approaches view the database as an entity-relationship graph, where direct and indirect relationships correspond to paths in the graph. These techniques rely on measuring the connection strength among various nodes in the graph by using a connection strength (CS) model. While such approaches have demonstrated significant advantage over traditional ER techniques, currently they also have a significant limitation: the CS models that they use are intuition-based fixed models that tend to behave well in general, but are very generic and not tuned to a specific domain, leading to suboptimal result quality. Hence, in this article we propose an approach that employs supervised learning to adapt the connection strength measure to the given domain using the available past/training data. The adaptive approach has several advantages: it increases both the quality and efficiency of ER and it also minimizes the domain analyst participation needed to tune the CS model to the given domain. The extensive empirical evaluation demonstrates that the proposed approach reaches up to 8% higher accuracy than the graph-based ER methods that use fixed and intuition-based CS models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call