Entity Matching on Unstructured Data: An Active Learning Approach

Ursin Brunner,Kurt Stockinger

doi:10.1109/sds.2019.00006

Abstract

With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured realword data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Entity Matching on Unstructured Data: An Active Learning Approach

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jun 1, 2019
Citations: 24	License type: other-oa

Similar Papers

Unleashing the Power of Machine Learning to Predict Myocardial Recovery After Left Ventricular Assist Device: A Call for the Inclusion of Unstructured Data Sources in Heart Failure Registries.
Ramsey M Wehbe
Circulation. Heart failure | VOL. 15
Ramsey M WehbeRamsey M Wehbe
24 Dec 2021
Circulation. Heart failure | VOL. 15

Tracking scalar features in unstructured data sets
...
-
, et. al. ...
18 Oct 1998
18 Oct 1998

Tracking scalar features in unstructured data sets
D Silver ... X Wang
-
D Silver, et. al.D Silver ... X Wang
01 Jan 1998
01 Jan 1998

Automated extraction of fragments of Bayesian networks from textual sources
Marcello Trovati ... Nik Bessis
Applied Soft Computing | VOL. 60
Marcello Trovati, et. al.Marcello Trovati ... Nik Bessis
12 Jul 2017
Applied Soft Computing | VOL. 60

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Entity Matching on Unstructured Data: An Active Learning Approach

Abstract

Talk to us

Similar Papers