Hybrid filtering methods for feature selection in high-dimensional cancer data

Siti Sarah Md Noh,Marina Yusoff,Nurain Ibrahim,Mahayaudin M Mansor

doi:10.11591/ijece.v13i6.pp6862-6871

Siti Sarah Md Noh, Marina Yusoff + Show 2 more

Open Access

https://doi.org/10.11591/ijece.v13i6.pp6862-6871

Copy DOI

Abstract

Statisticians in both academia and industry have encountered problems with high-dimensional data. The rapid feature increase has caused the feature count to outstrip the instance count. There are several established methods when selecting features from massive amounts of breast cancer data. Even so, overfitting continues to be a problem. The challenge of choosing important features with minimum loss in a different sample size is another area with room for development. As a result, the feature selection technique is crucial for dealing with high-dimensional data classification issues. This paper proposed a new architecture for high-dimensional breast cancer data using filtering techniques and a logistic regression model. Essential features are filtered out using a combination of hybrid chi–square and hybrid information gain (hybrid IG) with logistic regression as classifier. The results showed that hybrid IG performed the best for high-dimensional breast and prostate cancer data. The top 50 and 22 features outperformed the other configurations, with the highest classification accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic function (hybrid IG+LR) with a sample size of 75. In the future, multiclass classification of multidimensional medical data to be evaluated using data from a different domain.

Full Text