Hybrid Dimensionality Reduction Forest With Pruning for High-Dimensional Data Classification

Weihong Chen,C. L. Philip Chen,Wenming Cao,Zhiwen Yu,Yuhong Xu,Guoqiang Han

doi:10.1109/access.2020.2975905

Abstract

The classification of high-dimensional data is a challenge in machine learning. Traditional classifier ensemble methods improve the diversity of classifiers through either dimensionality reduction or sample selection for high-dimensional data classification. However, these methods have several limitations: 1) dimensionality reduction methods easily cause information loss, which leads to a decrease in accuracy; 2) sample selection methods are susceptible to noise and redundant features. To address the above limitations, we propose a novel hybrid dimensionality reduction forest (HDRF) to increase the diversity of an integrated system from feature space and sample space. First, a tree-based feature selection algorithm is employed to partition effective features. Then the Bagging method is applied to obtain diverse training subsets. To fully retain and mine the important information of the unselected samples, a sample-feature based transformation process (SFTP) is proposed to generate the extended features. Since PCA can effectively reduce dimension and remove noise features, it is applied to compress the unselected features and the extended features into the new features which are compact and compensatory. Further, a novel classifier ensemble pruning framework (HDRFPF) based on HDRF is designed to remove redundant and invalid classifiers. Experimental results on 23 high-dimensional data sets verify that our method outperforms mainstream classifier ensemble methods, and the better results are obtained on 19 out of 23 datasets.

Highlights

Classification is a hot topic in supervised learning, which can be realized by training a classifier or a group of classifiers
For high-dimensional data classification, traditional ensemble learning methods have some limitations: 1) most of classifier ensemble methods improve the diversity of classifiers in either the sample space or feature space, and the transformation from sample to feature is not considered to improve the diversity of classifier; 2) most of classifier ensemble methods for high-dimensional data are realized through direct dimensionality reduction, rather than increasing the number of features to improve the diversity of classifiers before dimensionality reduction; 3) most of ensemble pruning methods are optimized
To fully retain and mine the important information of the unselected samples, a sample-feature based transformation process (SFTP) is proposed, in which unselected samples are used as auxiliary information to construct new features

Summary

INTRODUCTION

Classification is a hot topic in supervised learning, which can be realized by training a classifier or a group of classifiers. The motivation of this paper is to improve the diversity of classifiers by combining unselected samples and unselected weak features while maintaining the classification ability of classifiers To achieve this motivation, we propose a novel hybrid dimensionality reduction forest (HDRF) to increase the diversity of integrated system from feature space and sample space. The contributions of this paper are summarized as follows: 1) We propose a sample-feature based transformation process, where unselected samples are used as the auxiliary information to construct effective and diverse features; 2) We design a new hybrid dimensionality reduction forest to increase the diversity of the integrated system from feature space and sample space; 3) Considering the influence of redundant classifiers for the integrated system, an ensemble forest pruning process is proposed to remove redundant classifiers; 4) We compare our method with mainstream ensemble learning methods on multiple high-dimensional datasets to verify its effectiveness

RELATED WORK

EXPERIMENTS

EFFECT OF THE FEATURE SELECTION RATE

COMPARISON WITH DIFFERENT CLASSIFIER ENSEMBLE METHODS

Findings

CONCLUSION