Abstract

High-dimensional data have been a challenging problem in classification. Feature selection works as a filter to remove irrelevant or redundant features and has made comparative progress. However, this problem is still challenging because current methods consider only the correlation between two variables while leaving the correlation among multiple variables largely unsolved, and multivariate interactions can contain joint information that cannot be obtained pairwise. Furthermore, many feature selection methods require hyperparameter settings, which require prior knowledge and lack interpretability. Focusing on the above problems, this paper proposes the total correlation information coefficient-based feature selection (TCIC_FS) method to select the optimal solution, which can avoid setting hyperparameters and fully consider the correlations among multiple variables. First, based on a Gaussian copula, the total correlation information coefficient (TCIC) is proposed to evaluate the correlations among multiple variables. Compared with the existing multivariate correlation methods, TCIC can measure a wider range of multivariate correlations, including linear, nonlinear, functional, and nonfunctional correlations. Second, a novel evaluation mechanism based on TCIC is proposed to measure the relevance between features and classes and the redundancy between a single feature and a selected feature subset. Finally, the TCIC_FS method is constructed based on the TCIC and the evaluation mechanism. Compared with the baseline values, the TCIC_FS method has the lowest time complexity and the smallest optimal feature subset obtained by single selection. Therefore, TCIC_FS is more suitable for processing high-dimensional data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call