Abstract

In high-dimensional data, the performances of various classifiers are largely dependent on the selection of important features. Most of the individual classifiers with the existing feature selection (FS) methods do not perform well for highly correlated data. Obtaining important features using the FS method and selecting the best performing classifier is a challenging task in high throughput data. In this article, we propose a combination of resampling-based least absolute shrinkage and selection operator (LASSO) feature selection (RLFS) and ensembles of regularized regression (ERRM) capable of dealing data with the high correlation structures. The ERRM boosts the prediction accuracy with the top-ranked features obtained from RLFS. The RLFS utilizes the lasso penalty with sure independence screening (SIS) condition to select the top k ranked features. The ERRM includes five individual penalty based classifiers: LASSO, adaptive LASSO (ALASSO), elastic net (ENET), smoothly clipped absolute deviations (SCAD), and minimax concave penalty (MCP). It was built on the idea of bagging and rank aggregation. Upon performing simulation studies and applying to smokers’ cancer gene expression data, we demonstrated that the proposed combination of ERRM with RLFS achieved superior performance of accuracy and geometric mean.

Highlights

  • With the advances of high throughput technology in biomedical research, large volumes of high-dimensional data are being generated [1,2,3]

  • The resampling-based lasso feature selection (RLFS) method ranks the features by employing the lasso method with a resampling approach and the b-sure independence screening (SIS) criteria to set the threshold for selecting the optimal number of features, and these features are applied on the ensemble of regularized regression models (ERRM) classifier, which uses bootstrapping and rank aggregation to select the best performing model across the bootstrapped samples and helps in attaining the best prediction accuracy in a high dimensional setting

  • We proposed a combination of the ensembles of regularized regression models (ERRM) with resampling-based lasso feature selection (RLFS) for attaining better prediction accuracies in high dimensional data

Read more

Summary

Introduction

With the advances of high throughput technology in biomedical research, large volumes of high-dimensional data are being generated [1,2,3]. Some of the examples of what produces such data are microarray gene expression [4,5,6] data sequencing, RNA-seq [7], genome-wide association studies (GWASs) [8,9], and DNA-methylation studies [10,11] These data are high dimensional in nature, where the total count of features is significantly larger than the number of samples ( p >> n)—termed the curse of dimensionality. This is one of the major problems, there are many other problems, such as noise, redundancy, and over parameterization. There is a lack of threshold to select the optimal

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call