Stable bagging feature selection on medical data

Salem Alelyani

doi:10.1186/s40537-020-00385-8

Abstract

In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

Highlights

In the growth of data mining and collection technologies, data learning and understanding are a tedious task due to a large number of features present that are known as variables or attributes
The remainder of the paper is organized as follows: (i) we introduce feature selection algorithms in "Feature selection algorithm" section, (ii) we give a literature review of the stability and how to evaluate it in "Feature selection stability" section, (iii) we provide the proposed ensemble method in "Proposed method: bagging feature selection" section, (iv) we conduct an experiment on microarray datasets in "Experiment" section, and (v) we discuss the results and conclude the paper
As we can see in the figures, the proposed ensemble bagging technique improves the stability on each single case

Summary

Introduction

In the growth of data mining and collection technologies, data learning and understanding are a tedious task due to a large number of features present that are known as variables or attributes. Data harvesting is conducted in relation to a specific problem, such as collecting human genomes from patients for a particular disease, gathering social media data for gender identification, or crawling websites for offensive materials to name just a few. Alelyani J Big Data (2021) 8:11 the class of the dataset, the learning is called supervised learning. Otherwise, it is called unsupervised learning [1,2,3]. Most of the collected data suffer from high dimensionality the includes a high number of features. Most of these features are irrelevant and noisy [4, 5]. The feature selection algorithm f() with respect to the class y could be represented in the following mathematical equation:

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Jan 7, 2021
Citations: 33	License type: open-access

R Discovery Prime

R Discovery Prime

Stable bagging feature selection on medical data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Stable feature selection via dense feature groups
Lei Yu ... Chris Ding
-
Lei Yu, et. al.Lei Yu ... Chris Ding
24 Aug 2008
24 Aug 2008

A Stable Instance Based Filter for Feature Selection in Small Sample Size Data Sets
Afef Ben Brahim ... Mohamed Limam
-
Afef Ben Brahim, et. al.Afef Ben Brahim ... Mohamed Limam
01 Jan 2014
01 Jan 2014

Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data
Aiguo Wang ... Guilin Chen
Computers in Biology and Medicine | VOL. 142
Aiguo Wang, et. al.Aiguo Wang ... Guilin Chen
05 Jan 2022
Computers in Biology and Medicine | VOL. 142

A Two-Stage Feature Selection Method for Gene Expression Data
Li-Yeh Chuang ... Cheng-Hong Yang
OMICS: A Journal of Integrative Biology | VOL. 13
Li-Yeh Chuang, et. al.Li-Yeh Chuang ... Cheng-Hong Yang
01 Apr 2009
OMICS: A Journal of Integrative Biology | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stable bagging feature selection on medical data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data