Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest

Maria Irmina Prasetiyowati,Kridanto Surendro,Nur Ulfa Maulidevi

doi:10.1186/s40537-021-00472-4

Maria Irmina Prasetiyowati, Kridanto Surendro + Show 1 more

Open Access

https://doi.org/10.1186/s40537-021-00472-4

Copy DOI

Journal: Journal of Big Data	Publication Date: Jun 5, 2021
Citations: 29	License type: open-access

Affiliation: Bandung Institute of Technology

Abstract

Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared.

Highlights

Data development increases dimensions and computational costs are overcome by feature selection and extraction, which are two different techniques [1, 2]
Differ from the previous research is that this study examines whether the dataset and features transformed by Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) produce better accuracy & average speed values compared to choosing the Correlation-Base Feature Selection model and threshold of 0.05
The proposed threshold was compared with the Correlation-Base Feature Selection, The average trial accuracy using the original dataset showed that 60% of the proposed threshold method produced higher parameters than the CorrelationBase Feature Selection (CBFS) algorithm, with 10% having the same rates

Summary

Introduction

Data development increases dimensions and computational costs are overcome by feature selection and extraction, which are two different techniques [1, 2]. Differ from the previous research is that this study examines whether the dataset and features transformed by FFT and IFFT produce better accuracy & average speed values compared to choosing the Correlation-Base Feature Selection model and threshold. This study follows previous research on the use of feature selection to increase the Random Forest method performance on high dimensions [31]. It examines the speed and accuracy evaluation of Random Forest performance by selecting features in the transformation data [8]. The 2nd, 3rd, 4th, 5th, and 6th sections are the related work, the proposed method, the research results, discussion, and conclusion, respectively

Related work

Result

Discussion

Findings

Conclusion

C O M PAR ISONOF AV ERAGE AC CURACY VAL U E