Abstract

Uncovering driver genes is crucial for understanding heterogeneity in cancer. L 1-type regularization approaches have been widely used for uncovering cancer driver genes based on genome-scale data. Although the existing methods have been widely applied in the field of bioinformatics, they possess several drawbacks: subset size limitations, erroneous estimation results, multicollinearity, and heavy time consumption. We introduce a novel statistical strategy, called a Recursive Random Lasso (RRLasso), for high dimensional genomic data analysis and investigation of driver genes. For time-effective analysis, we consider a recursive bootstrap procedure in line with the random lasso. Furthermore, we introduce a parametric statistical test for driver gene selection based on bootstrap regression modeling results. The proposed RRLasso is not only rapid but performs well for high dimensional genomic data analysis. Monte Carlo simulations and analysis of the “Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project” show that the proposed RRLasso is an effective tool for high dimensional genomic data analysis. The proposed methods provide reliable and biologically relevant results for cancer driver gene selection.

Highlights

  • Much research is currently underway to understand the complexity of the heterogeneous genetic networks underlying cancer

  • We have described the proposed variable selection strategy focused on the random lasso procedure, the parametric statistical test will be a useful tool for bootstrap regression modeling

  • We used a ridge estimator for weight in the existing adaptive lasso, and we considered the threshold of the existing random lasso to be s/n, and selected s based on the root mean squared error in the validation dataset

Read more

Summary

Introduction

Much research is currently underway to understand the complexity of the heterogeneous genetic networks underlying cancer. To identify the heterogeneous genetic networks that underlie cancer, various large scale-omics projects (e.g., The Cancer Genome Project, The Cancer Genome Atlas (TCGA), Sanger Genomics of Drug Sensitivity in Cancer dataset from the Cancer Genome Project, and others) have been initiated and have provided large amounts of data, such as genomic and epigenomic data for cancer patients or cell lines. Recursive Random Lasso various L1-type regularization approaches, e.g., lasso [1] and elastic net [2], have been widely used to identify cancer driver genes, they possess several drawbacks as tools for driver gene identification [3]. The elastic net, which has been widely used in bioinformatics research, may provide erroneous estimation results for coefficients of highly correlated variables with different magnitudes, especially those that differ in sign, because of its “grouping effect”. Adaptive L1-type regularization methods suffer from multicollinearity, since their adaptive data driven weights are based on Ordinary Least squares (OLS) estimators

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call