Bayesian Hyper-LASSO Classification for Feature Selection with Application to Endometrial Cancer RNA-seq Data

Longhai Li,Weixin Yao,Lai Jiang,Celia M. T. Greenwood

doi:10.1038/s41598-020-66466-z

Abstract

Feature selection is demanded in many modern scientific research problems that use high-dimensional data. A typical example is to identify gene signatures that are related to a certain disease from high-dimensional gene expression data. The expression of genes may have grouping structures, for example, a group of co-regulated genes that have similar biological functions tend to have similar expressions. Thus it is preferable to take the grouping structure into consideration to select features. In this paper, we propose a Bayesian Robit regression method with Hyper-LASSO priors (shortened by BayesHL) for feature selection in high dimensional genomic data with grouping structure. The main features of BayesHL include that it discards more aggressively unrelated features than LASSO, and it makes feature selection within groups automatically without a pre-specified grouping structure. We apply BayesHL in gene expression analysis to identify subsets of genes that contribute to the 5-year survival outcome of endometrial cancer (EC) patients. Results show that BayesHL outperforms alternative methods (including LASSO, group LASSO, supervised group LASSO, penalized logistic regression, random forest, neural network, XGBoost and knockoff) in terms of predictive power, sparsity and the ability to uncover grouping structure, and provides insight into the mechanisms of multiple genetic pathways leading to differentiated EC survival outcome.

Highlights

The accelerated development of many high-throughput biotechnologies has made it affordable to collect complete sets of measurements of gene expressions
We proposed a feature selection method, Bayesian Robit regression with Hyper-LASSO priors (BayesHL), that employs Markov chain Monte Carlo (MCMC) to explore the posteriors of Robit classification models with heavy-tailed priors
We would like to improve the accuracy of our feature subset selection method, and apply our Bayesian Inference framework to other models and non-convex penalties

Summary

Introduction

The accelerated development of many high-throughput biotechnologies has made it affordable to collect complete sets of measurements of gene expressions. Researchers developed methods that are even more aggressive, and even more sparse, than LASSO They proposed fitting classification or regression models with continuous non-convex penalty functions to discover features related to a response. Reviews of non-convex penalty functions have been provided by[7,21,22] and[23] These non-convex penalties can shrink the coefficients of unrelated features (noise) to zero more aggressively than LASSO, while enlarging the coefficients of related features (signal). These non-convex functions work well for sparsity; their effectiveness regarding grouping structure has not yet been explored. Optimization algorithms encounter great difficulty in reaching a global or good mode because in non-convex regions, the solution paths are discontinuous and erratic

Objectives

Methods

Findings

Conclusion