Abstract

The typical inaccuracy of data gathering and preparation procedures makes erroneous and unnecessary information to be a common issue in real-world applications. In this context, feature selection methods are used in order to reduce the harmful impact of such information in data analysis by removing irrelevant features from datasets. This research presents a novel feature selection method in the field of unsupervised learning, in which the complexity arises from the fact that the class labels cannot be used to select the most discriminative features as it is traditionally performed in supervised learning. The technique designed, which is called Kolmogorov-Smirnov test-based Unsupervised Feature Selection ( KSUFS ), is based on the computation of estimated feature distributions that are later compared to the original ones using non-parametric statistical tests to provide the most representative input variables. Two versions of the KSUFS are presented in this study: one of them is particularly designed to deal with standard data, in which the accuracy of the method prevalences over other of its aspects; the other version is designed to treat with big data problems, in which the computational complexity is improved due to the characteristics of this type of datasets. The KSUFS is successfully compared to other state-of-the-art unsupervised feature selection techniques in a thorough experimental study, which considers both standard and big data problems. The results obtained show that the method proposed is able to outperform the rest of reference unsupervised feature selection methods considered in the comparisons, selecting the first most influential features for standard datasets and particularly highlighting when big data problems are treated.

Highlights

  • The complexity of data preparation processes in real-world applications, such as those related to medicine [26] or big data processing [4], usually produces datasets containing unnecessary and erroneous information [30], [39], [45]

  • When considering the results for the control point with the best performance for each dataset, these clearly show the benefits of feature selection in unsupervised problems, since Kolmogorov-Smirnov test-based Unsupervised Feature Selection (KSUFS) obtains a higher performance than not applying feature selection (None) in almost all the datasets and the differences found are significant, as shown the low p-value obtained (2.67E-04). These results show the benefits of applying unsupervised feature selection methods with respect to not preprocessing

  • Considering the results of each dataset in the control point with the best performance for KSUFS, the results clearly shows the better performance of applying feature selection, since it obtains a higher performance than None in almost all the datasets (18 out of 20) and the differences found are significant

Read more

Summary

Introduction

The complexity of data preparation processes in real-world applications, such as those related to medicine [26] or big data processing [4], usually produces datasets containing unnecessary and erroneous information [30], [39], [45] Features incorporating such harmful information may imply important drawbacks in data analysis [27]. Feature selection [6], [34] chooses a subset of features from a given dataset removing its irrelevant and noisy features in order to represent the original data Between these two approaches, this research focuses on feature selection methods because many applications require of building highly-interpretable models [26], [35] and, the meaning of the original variables in the data must be retained

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.