Crowdsourcing algorithms for entity resolution

Norases Vesdapunt,Kedar Bellare,Nilesh Dalvi

doi:10.14778/2732977.2732982

Abstract

In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c ). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as " optimal " for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Crowdsourcing algorithms for entity resolution

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Aug 1, 2014
Citations: 177

Similar Papers

Scalable focussed entity resolution
Ranganath B N ... Shalabh Bhatnagar
-
Ranganath B N, et. al. Ranganath B N ... Shalabh Bhatnagar
01 Jul 2016
01 Jul 2016

Determining the clinical applicability of machine learning models through assessment of reporting across skin phototypes and rarer skin cancer types: A systematic review.
Lloyd Steele ... Jing Mia Gao
Journal of the European Academy of Dermatology and Venereology | VOL. 37
Lloyd Steele, et. al.Lloyd Steele ... Jing Mia Gao
02 Jan 2023
Journal of the European Academy of Dermatology and Venereology | VOL. 37

FlexER: Flexible Entity Resolution for Multiple Intents
Bar Genossar ... Roee Shraga
Proceedings of the ACM on Management of Data | VOL. 1
Bar Genossar, et. al.Bar Genossar ... Roee Shraga
26 May 2023
Proceedings of the ACM on Management of Data | VOL. 1

Perception without preconception: comparison between the human and machine learner in recognition of tissues from histological sections
Sanghita Barui ... K S Rajmohan
Scientific Reports | VOL. 12
Sanghita Barui, et. al.Sanghita Barui ... K S Rajmohan
30 Sep 2022
Scientific Reports | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Crowdsourcing algorithms for entity resolution

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment