CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Peng Li,Jennifer Blase,Xu Chu,Yue Zhang,Ce Zhang,Xi Rao

doi:10.1109/icde51399.2021.00009

Abstract

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML — ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics.We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.

Highlights

The quality of machine learning (ML) applications is only as good as the quality of the data it trained on, and data cleaning has been the cornerstone of building high-quality ML models
Upon examining the detected duplicates, we find that duplicate detection algorithms may produce many false positives, where some non-duplicated examples are incorrectly identified as duplicates
This is because ZeroER is more aggressive and produces more false positives than key collision detection on the datasets upon examining the cleaning results. (Q4.2): Q4.2 is not applicable to duplicates because there is only one repair method. (Q5): We only show the result of Q5 issued against R1, as the results on R2 and R3 reveal similar findings

Summary

Introduction

The quality of machine learning (ML) applications is only as good as the quality of the data it trained on, and data cleaning has been the cornerstone of building high-quality ML models. Not surprisingly, both ML and database (DB) communities have been working on problems associated with dirty data:. Our goal is to (1) conduct a first systematic empirical study on the impact of data cleaning on downstream ML classification models, for different error types, cleaning methods, and ML models; (2) given our empirical findings, provide a starting point for future research to advance the field of cleaning for ML

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Apr 1, 2021
Citations: 31	License type: cc-by

Similar Papers

Machine Learning Analysis of RNA-seq Data for Diagnostic and Prognostic Prediction of Colon Cancer
Erkan Bostanci ... Tunc Asuroglu
Sensors | VOL. 23
Erkan Bostanci, et. al.Erkan Bostanci ... Tunc Asuroglu
13 Mar 2023
Sensors | VOL. 23

Information-Theoretic Bounds on Quantum Advantage in Machine Learning.
Hsin-Yuan Huang ... John Preskill
Physical Review Letters | VOL. 126
Hsin-Yuan Huang, et. al.Hsin-Yuan Huang ... John Preskill
14 May 2021
Physical Review Letters | VOL. 126

Application of Machine Learning to Interpret Steady-State Drainage Relative Permeability Experiments
Eric Sonny Mathew ... Emad W Al-Shalabi
SPE Reservoir Evaluation & Engineering | VOL. 26
Eric Sonny Mathew, et. al.Eric Sonny Mathew ... Emad W Al-Shalabi
22 Mar 2023
SPE Reservoir Evaluation & Engineering | VOL. 26

Transfer-Ensemble Learning: A Novel Approach for Mapping Urban Land Use/Cover of the Indian Metropolitans
Prosenjit Barman ... Monika Kuffer
Sustainability | VOL. 15
Prosenjit Barman, et. al.Prosenjit Barman ... Monika Kuffer
06 Dec 2023
Sustainability | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers