Quality-aware similarity assessment for entity matching in Web data

Surender Reddy Yerva,Zoltán Miklós,Karl Aberer

doi:10.1016/j.is.2011.09.007

Surender Reddy Yerva, Zoltán Miklós + Show 1 more

Open Access

PDF Available

https://doi.org/10.1016/j.is.2011.09.007

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

One of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such as persons, companies, geographic locations, in Web documents. The names of these extracted entities are, however, non-unique; the same name on different Web pages might or might not refer to the same entity. The entity matching problem concerns of identifying the entities, which are referring to the same real-world entity. This problem is very similar to the entity resolution problem studied in relational databases, however, there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the entities. Similarity functions try to capture the degree of belief about the equivalence of two entities, thus they play a crucial role in entity matching. The accuracy of the similarity functions highly depends on the applied assessment techniques, but also on some specific features of the entities. We propose systematic design strategies for combined similarity functions in this context. Our method relies on the combination of multiple evidences, with the help of estimated quality of the individual similarity values and with particular attention to missing information that is common in Web context. We study the effectiveness of our method in two specific instances of the general entity matching problem, namely the person name disambiguation and the Twitter message classification problem. In both cases, using our techniques in a very simple algorithmic framework we obtained better results than the state-of-the-art methods.

Full Text