Abstract

Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely textit{principal component analysis} hbox { (PCA)}, textit{independent component analysis} hbox { (ICA)}, and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).

Highlights

  • Classification involves predicting the class of a new sample by using a model derived from training data

  • Tomek links are excluded for these figures because this algorithm does not rely on reconstruction error calculations for label noise detection

  • In this paper, we propose a novel and effective method to deal with the label noise problem

Read more

Summary

Introduction

Classification involves predicting the class of a new sample by using a model derived from training data. Each sample ( known as an instance) is associated with an observed label. Models trained on datasets with high levels of label noise will not generalize well to new data [2, 3]. The subspace method works by dividing the principal axes into two sets representing normal and anomalous data variations. Any data instance represented by a row in the dataset can be defined as y = y + yby representing it as normal ( y ) and anomalous subspace ( y ). To determine the magnitude of the projection of each instance to an anomalous subspace, we first examine the set of principal components in the normal subspace as columns of the matrix P of size m × r

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.