Abstract

Feature selection is a challenging problem that occurs in the high-dimensional data analysis of many major applications. It addresses the curse of dimensionality by determining a small set of features to represent high-dimensional data without significant or noticeable loss of information. The purpose of this study is to develop and investigate a new unsupervised feature selection method which uses the k-influence space concept and subspace learning to map features onto a weighted graph and rank them by importance according to the PageRank graph centrality measure. The graph design in this method promotes feature relevance, downgrades redundancy, and is robust to outliers and cluster imbalances. In K-Means classification experiments using the ASU feature selection testing datasets, the method produces better accuracy and normalized mutual information results than state-of-the-art unsupervised feature selection algorithms. In a further evaluation, using a dataset of over 14,000 tweets, conventional classification of features selected by the method gave better sentiment analysis results than deep learning feature selection and classification.

Highlights

  • Progress in science and technology has allowed the development of applications that use very large data sets of high-dimensional data

  • The purpose of this study is to investigate a new unsupervised feature selection method, called Influence Space and Graph-based Feature Selection (ISGFS), which uses the k-influence space concept [22]–[24] and subspace learning to describe feature relationships and subsequently design a feature selection graph

  • The Unsupervised Graph-based Feature Selection (UGFS) performed significantly better than other methods, three drawbacks have been noted: (i) the elements in a k-nearest neighbors set may belong to different clusters, and the search for cluster discriminating features may be unduly affected; (ii)considering all data points for feature combination may unduly change the results, especially if the data set contains outliers; and (iii) the correlation between features is not exploited as a means to disfavor the redundant features [15]

Read more

Summary

Introduction

Progress in science and technology has allowed the development of applications that use very large data sets of high-dimensional data. These applications occur in various domains, most notably natural language processing, pattern recognition, and computer vision [1], [2]. The curse of dimensionality is likely to overfit training data, and to produce models that do not generalize to new data which they will fail to interpret [1], [4], [5]. Recent studies have explicitly addressed such issues in various ways [6]–[8], such as dimensionality reduction [9] by feature selection and reduction, done before data analysis, subspace learning to determine data layout and properties to assist clustering [7], classification [10], as well as representation by similarity and kernel functions [6].

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.