Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning

Alexis Cvetkov-Iliev,Gaël Varoquaux,Alexandre Allauzen

doi:10.1109/access.2022.3168013

Alexis Cvetkov-Iliev, Gaël Varoquaux + Show 1 more

Open Access

https://doi.org/10.1109/access.2022.3168013

Copy DOI

Abstract

Data analysis is increasingly performed over data assembled from uncontrolled sources, facing inconsistency in knowledge-representation conventions. The typical practice is to create “clean” data for analysis, matching entities and merging variants to overcome differences in knowledge representation. Despite progress in data management techniques to automate this process, it still needs labor-intensive supervision from the analyst. In this paper, we evaluate the benefit of advanced statistical tools to address directly many analytic tasks across data sources without such entity-matching cleaning. Reframing analytical questions as machine-learning tasks enables to replace exact matching of entities by continuous descriptions–vectorial embeddings– that expose similarities between entries. But are analyses with less cleaning trustworthy? We answer this question with a thorough benchmark on questions typical of socio-economic studies across 14 employee databases: we compare the approaches based on machine learning to manual data cleaning (entity matching). It reveals that using embeddings and machine learning improves results validity (smaller estimation error) more than manual cleaning, with considerably less human labor. While machine learning is often combined with data management for the purpose of cleaning, our study suggests that using it directly for <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">analysis</i> is beneficial because it captures ambiguities hard to represent during curation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Preface to the Special Issue on Data Management and Analysis Technique Supporting AI
Lei Chen ... Yongxin Tong
International Journal of Software and Informatics | VOL. 11
Lei Chen, et. al.Lei Chen ... Yongxin Tong
01 Jan 2020
International Journal of Software and Informatics | VOL. 11

Entity Matching on Unstructured Data: An Active Learning Approach
Ursin Brunner ... Kurt Stockinger
-
Ursin Brunner, et. al.Ursin Brunner ... Kurt Stockinger
01 Jun 2019
01 Jun 2019

Author response: Machine learning-assisted discovery of growth decision elements by relating bacterial population dynamics to environmental diversity
Honoka Aida ... Kazuha Ashino
-
Honoka Aida, et. al.Honoka Aida ... Kazuha Ashino
08 Jun 2022
08 Jun 2022

Fighting pandemics with digital epidemiology.
Sasu Tarkoma ... Michael D Howell
EClinicalMedicine | VOL. 26
Sasu Tarkoma, et. al.Sasu Tarkoma ... Michael D Howell
25 Aug 2020
EClinicalMedicine | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning

Abstract

Talk to us

Similar Papers

More From: IEEE Access