A reconstruction error-based framework for label noise detection

Zahra Salekshahrezaee,Joffrey L Leevy,Taghi M Khoshgoftaar

doi:10.1186/s40537-021-00447-5

Zahra Salekshahrezaee, Joffrey L Leevy + Show 1 more

Open Access

https://doi.org/10.1186/s40537-021-00447-5

Copy DOI

Journal: Journal of Big Data	Publication Date: Apr 15, 2021
Citations: 21	License type: open-access

Affiliation: Florida Atlantic University

Abstract

Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely textit{principal component analysis} hbox { (PCA)}, textit{independent component analysis} hbox { (ICA)}, and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).

Highlights

Classification involves predicting the class of a new sample by using a model derived from training data
Tomek links are excluded for these figures because this algorithm does not rely on reconstruction error calculations for label noise detection
In this paper, we propose a novel and effective method to deal with the label noise problem

Summary

Introduction

Classification involves predicting the class of a new sample by using a model derived from training data. Each sample ( known as an instance) is associated with an observed label. Models trained on datasets with high levels of label noise will not generalize well to new data [2, 3]. The subspace method works by dividing the principal axes into two sets representing normal and anomalous data variations. Any data instance represented by a row in the dataset can be defined as y = y + yby representing it as normal ( y ) and anomalous subspace ( y ). To determine the magnitude of the projection of each instance to an anomalous subspace, we first examine the set of principal components in the normal subspace as columns of the matrix P of size m × r

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A reconstruction error-based framework for label noise detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

FedCorr: Multi-Stage Federated Learning for Label Noise Correction
Jingyi Xu ... Kai Fong Ernest Chong
-
Jingyi Xu, et. al.Jingyi Xu ... Kai Fong Ernest Chong
01 Jun 2022
01 Jun 2022

Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets
Fangzhou Yao ... Jeff Coquery
BMC Bioinformatics | VOL. 13
Fangzhou Yao, et. al.Fangzhou Yao ... Jeff Coquery
03 Feb 2012
BMC Bioinformatics | VOL. 13

Chapter 3 - A unified probabilistic model for independent and principal component analysis
Aapo Hyvärinen
Advances in Independent Component Analysis and Learning Machines | VOL. -
Aapo HyvärinenAapo Hyvärinen
01 Jan 2015
Advances in Independent Component Analysis and Learning Machines | VOL. -

A Label Noise Robust Stacked Auto-Encoder Algorithm for Inaccurate Supervised Classification Problems
Zi-Yang Wang ... Xiao-Yi Luo
Mathematical Problems in Engineering | VOL. 2019
Zi-Yang Wang, et. al.Zi-Yang Wang ... Xiao-Yi Luo
14 May 2019
Mathematical Problems in Engineering | VOL. 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A reconstruction error-based framework for label noise detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data