Abstract

In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

Highlights

  • In the growth of data mining and collection technologies, data learning and understanding are a tedious task due to a large number of features present that are known as variables or attributes

  • The remainder of the paper is organized as follows: (i) we introduce feature selection algorithms in "Feature selection algorithm" section, (ii) we give a literature review of the stability and how to evaluate it in "Feature selection stability" section, (iii) we provide the proposed ensemble method in "Proposed method: bagging feature selection" section, (iv) we conduct an experiment on microarray datasets in "Experiment" section, and (v) we discuss the results and conclude the paper

  • As we can see in the figures, the proposed ensemble bagging technique improves the stability on each single case

Read more

Summary

Introduction

In the growth of data mining and collection technologies, data learning and understanding are a tedious task due to a large number of features present that are known as variables or attributes. Data harvesting is conducted in relation to a specific problem, such as collecting human genomes from patients for a particular disease, gathering social media data for gender identification, or crawling websites for offensive materials to name just a few. Alelyani J Big Data (2021) 8:11 the class of the dataset, the learning is called supervised learning. Otherwise, it is called unsupervised learning [1,2,3]. Most of the collected data suffer from high dimensionality the includes a high number of features. Most of these features are irrelevant and noisy [4, 5]. The feature selection algorithm f() with respect to the class y could be represented in the following mathematical equation:

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.