Abstract

BackgroundIn the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available.ResultsWe identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application.ConclusionsThe new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.

Highlights

  • In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient

  • Klau et al [11] present the priority-Lasso, a lasso-type prediction method for multi-omics data that differs from the approaches described above in that its main focus is not prediction accuracy but applicability from a practical point of view: with this method the user has to provide a priority order of the blocks that is for example motivated by the costs of generating each type of data

  • For each of the five random forest variants for multi-omics data presented in this paper, we considered a variation that, for each split, includes all clinical covariates in the sets of variables tried out for finding the optimal split point

Read more

Summary

Introduction

In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. A PubMed search for the term “multi-omics” resulted in two papers from the year 2006 (first mentioning of the term), five papers from the year 2010, but 368 papers from the year 2018 This long-lasting lack of prediction methods tailored to multiomics covariate data was probably due to the fact that multi-omics data had not been available on a larger scale until recently. In the context of a comparison study, Boulesteix et al [5] again consider an approach based on combining prediction rules, each learned using a single block: first, lasso is fitted to each block and, second, the resulting linear predictors are used as covariates in a low-dimensional regression model. An extensive review of various statistical procedures commonly used in practice in the analysis of multi-omics data can be found in Wu et al [16]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call