Variable Selection Accuracy Research Articles

BackgroundPan-omics, pan-cancer analysis has advanced our understanding of the molecular heterogeneity of cancer. However, such analyses have been limited in their ability to use information from multiple sources of data (e.g., omics platforms) and multiple sample sets (e.g., cancer types) to predict clinical outcomes. We address the issue of prediction across multiple high-dimensional sources of data and sample sets by using molecular patterns identified by BIDIFAC+, a method for integrative dimension reduction of bidimensionally-linked matrices, in a Bayesian hierarchical model. Our model performs variable selection through spike-and-slab priors that borrow information across clustered data. We use this model to predict overall patient survival from the Cancer Genome Atlas with data from 29 cancer types and 4 omics sources and use simulations to characterize the performance of the hierarchical spike-and-slab prior.ResultsWe found that molecular patterns shared across all or most cancers were largely not predictive of survival. However, our model selected patterns unique to subsets of cancers that differentiate clinical tumor subtypes with markedly different survival outcomes. Some of these subtypes were previously established, such as subtypes of uterine corpus endometrial carcinoma, while others may be novel, such as subtypes within a set of kidney carcinomas. Through simulations, we found that the hierarchical spike-and-slab prior performs best in terms of variable selection accuracy and predictive power when borrowing information is advantageous, but also offers competitive performance when it is not.ConclusionsWe address the issue of prediction across multiple sources of data by using results from BIDIFAC+ in a Bayesian hierarchical model for overall patient survival. By incorporating spike-and-slab priors that borrow information across cancers, we identified molecular patterns that distinguish clinical tumor subtypes within a single cancer and within a group of cancers. We also corroborate the flexibility and performance of using spike-and-slab priors as a Bayesian variable selection approach.

Read full abstract

BackgroundPrevious studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen.ResultsThe accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables.Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble.ConclusionsWhen the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

Read full abstract

Variable Selection Accuracy Research Articles

Articles published on Variable Selection Accuracy

Automated Bayesian variable selection methods for binary regression models with missing covariate data

Likelihood-based surrogate dimension reduction

Robust penalized empirical likelihood in high dimensional longitudinal data analysis

A Tweedie Compound Poisson Model in Reproducing Kernel Hilbert Space

Variable Selection in Macroeconomic Forecasting with Many Predictors

Using Background Knowledge from Preceding Studies for Building a Random Forest Prediction Model: A Plasmode Simulation Study

A hierarchical spike-and-slab model for pan-cancer survival using pan-omic data

Weighted Cox regression for the prediction of heterogeneous patient subgroups

Expanding the Scope of Multivariate Regression Approaches in Cross-Omics Research

The sparse group lasso for high-dimensional integrative linear discriminant analysis with application to alzheimer's disease prediction

Robust boosting for regression problems

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Penalized linear regression with high-dimensional pairwise screening

Extending Classification Algorithms to Case-Control Studies.

Adaptive lasso for accelerated hazards models

Efficient test-based variable selection for high-dimensional linear models

On the oracle property of a generalized adaptive elastic-net for multivariate linear regression with a diverging number of parameters

Variable selection for high-dimensional genomic data with censored outcomes using group lasso prior

Randomizing outputs to increase variable selection accuracy

Mean and quantile boosting for partially linear additive models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Variable Selection Accuracy Research Articles

Articles published on Variable Selection Accuracy

Automated Bayesian variable selection methods for binary regression models with missing covariate data

Likelihood-based surrogate dimension reduction

Robust penalized empirical likelihood in high dimensional longitudinal data analysis

A Tweedie Compound Poisson Model in Reproducing Kernel Hilbert Space

Variable Selection in Macroeconomic Forecasting with Many Predictors

Using Background Knowledge from Preceding Studies for Building a Random Forest Prediction Model: A Plasmode Simulation Study

A hierarchical spike-and-slab model for pan-cancer survival using pan-omic data

Weighted Cox regression for the prediction of heterogeneous patient subgroups

Expanding the Scope of Multivariate Regression Approaches in Cross-Omics Research

The sparse group lasso for high-dimensional integrative linear discriminant analysis with application to alzheimer's disease prediction

Robust boosting for regression problems

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Penalized linear regression with high-dimensional pairwise screening

Extending Classification Algorithms to Case-Control Studies.

Adaptive lasso for accelerated hazards models

Efficient test-based variable selection for high-dimensional linear models

On the oracle property of a generalized adaptive elastic-net for multivariate linear regression with a diverging number of parameters

Variable selection for high-dimensional genomic data with censored outcomes using group lasso prior

Randomizing outputs to increase variable selection accuracy

Mean and quantile boosting for partially linear additive models