Selecting critical features for data classification based on machine learning methods

Rung-Ching Chen,Rezzy Eko Caraka,Christine Dewi,Su-Wen Huang

doi:10.1186/s40537-020-00327-4

Abstract

Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

Highlights

In machine learning problems, high dimensional data, especially in terms of many features, is increasingly these days [1]
The three datasets belong to classification data that have different total instances and features
Conclusions and future work In this paper, we compare four classifiers method Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). We combine those classifiers method with different features selection method RF, Recursive Feature Elimination (RFE), and Boruta to select the best classifiers method based on the accuracy of each classifier

Summary

Introduction

High dimensional data, especially in terms of many features, is increasingly these days [1]. Many researchers focus on the experiment to solve these problems. To extract important features from these high dimensional of variables and data. The statistical techniques were used to minimize noise and redundant data. We do not use all the features to train a model. We may improve our model with the features correlated and non-redundant, so feature selection plays an important role

Methods

Results

Conclusion