Elastic Correlation Adjusted Regression (ECAR) scores for high dimensional variable importance measuring

Yuan Zhou,Jianle Sun,Zhangsheng Yu,Yue Zhang,Botao Fa,Ting Wei

doi:10.1038/s41598-021-02706-0

Yuan Zhou, Jianle Sun + Show 4 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-021-02706-0

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Investigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores’ performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables’ predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.

Highlights

As a result of novel biotechnology such as next-generation sequencing (NGS) technologies, genomic and clinical research have benefited dramatically from the steep increase in both quantities and quality of molecular data
Some researchers propose to rank and select variables based on their assigned scores. An example of this is the variable importance computed by random forest[14], which has been applied in genetics[15], gene expression16, methylation17, proteomics[18], and metabolomics
We showed that Elastic Correlation Adjusted Regression (ECAR) improves CAR and rivals popular methods like the lasso in terms of the variable selection accuracy and the predictive power of high-rank variables

Summary

Introduction

As a result of novel biotechnology such as next-generation sequencing (NGS) technologies, genomic and clinical research have benefited dramatically from the steep increase in both quantities and quality of molecular data. Adjusting the threshold with multiple comparison criteria, such as Bonferroni or False Discovery Rate (FDR) correction, will cause variables of small to moderate effects to be erroneously discarded[10], introducing many false negatives Another class of methods are penalized regression models (e.g., lasso[11], elastic n et[12], minimax concave p enalty13), which aim to select a small set of predictors that are associated with a trait. CAR scores[21] and CARS scores[22] fall into this group, they are easy to calculate, but might not be flexible enough when the noise in data is too small or too large Due to these limitations of current approaches, we developed the Elastic Correlation Adjusted Regression (ECAR) score for simultaneously variable selection in high dimensional biological data. We showed that ECAR improves CAR and rivals popular methods like the lasso in terms of the variable selection accuracy and the predictive power of high-rank variables

Methods

Results

Conclusion