Abstract

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance $$\cdot $$ probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM ( https://dl.acm.org ), DBLP ( https://dblp.uni-trier.de ), and IMDB ( https://imdb.com ). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call