Hybrid distributed feature selection using particle swarm optimization-mutual information

Khumukcham Robindro,Sanasam Surjalata Devi,Urikhimbam Boby Clinton,Linthoingambi Takhellambam,Yambem Ranjan Singh,Nazrul Hoque

doi:10.1016/j.dsm.2023.10.003

Khumukcham Robindro, Sanasam Surjalata Devi + Show 4 more

Open Access

https://doi.org/10.1016/j.dsm.2023.10.003

Copy DOI

Journal: Data Science and Management	Publication Date: Oct 14, 2023
License type: cc-by-nc-nd

Affiliation: Manipur University

Abstract

Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).

Full Text