Abstract

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance $$\cdot $$ probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM ( https://dl.acm.org ), DBLP ( https://dblp.uni-trier.de ), and IMDB ( https://imdb.com ). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.