Abstract

BackgroundBuilding prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS), and principled SIS (PSIS), have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables.MethodsWe conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO), and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell’s concordance statistic), calibration, and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer.ResultsWhen n = 300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell’s c-index (> 0.80). On the other hand, with a small sample size (n = 150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67.ConclusionsChoosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.

Highlights

  • Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine

  • The aggressive sure independence screening (SIS) starts with randomly splitting the sample into two partitions, n1 and n2, SIS computes the marginal correlation between a single variable and the survival outcome within each partition

  • The highest R2BSð2Þ were observed for principled SIS (PSIS)-least absolute shrinkage and selection operator (LASSO) and PSIS-adaptive LASSO (ALASSO) when n = 150 regardless of the signal strength

Read more

Summary

Introduction

Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. The challenges, in ultra-high dimensional space are to reduce the dimensionality of the data, and to retain the important variables which predict the outcome Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS), and principled SIS (PSIS), have been developed to overcome the challenge of high dimensionality. LASSO, ALASSO, and RSF have been extended to time-to-event endpoints that are subject to censoring These methods are capable of reducing the number of variables in high dimensionality, Fan et al [10] and Zhao and Li [11] proposed methods, such as the sure independence screening (SIS) [10], the iterative SIS (ISIS) [10], and the principled SIS (PSIS) [11], to expedite computing time and improve estimation accuracy in a ultra-high dimensional setting.

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.