Predictive modelling using pathway scores: robustness and significance of pathway collections

Marcelo P. Segura-Lepe,Hector C. Keun,Timothy M. D. Ebbels

doi:10.1186/s12859-019-3163-0

Marcelo P. Segura-Lepe, Hector C. Keun + Show 1 more

Open Access

https://doi.org/10.1186/s12859-019-3163-0

Copy DOI

Journal: BMC bioinformatics	Publication Date: Nov 4, 2019
Citations: 20	License type: open-access

Affiliation: Imperial College London, Hammersmith Hospital

Abstract

BackgroundTranscriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Genes work together in pathways and it is widely thought that pathway representations will be more robust to noise in the gene expression levels. We aimed to test this hypothesis by constructing models based on either genes alone, or based on sample specific scores for each pathway, thus transforming the data to a ‘pathway space’. We progressively degraded the raw data by addition of noise and examined the ability of the models to maintain predictivity.ResultsModels in the pathway space indeed had higher predictive robustness than models in the gene space. This result was independent of the workflow, parameters, classifier and data set used. Surprisingly, randomised pathway mappings produced models of similar accuracy and robustness to true mappings, suggesting that the success of pathway space models is not conferred by the specific definitions of the pathway. Instead, predictive models built on the true pathway mappings led to prediction rules with fewer influential pathways than those built on randomised pathways. The extent of this effect was used to differentiate pathway collections coming from a variety of widely used pathway databases.ConclusionsPrediction models based on pathway scores are more robust to degradation of gene expression information than the equivalent models based on ungrouped genes. While models based on true pathway scores are not more robust or accurate than those based on randomised pathways, true pathways produced simpler prediction rules, emphasizing a smaller number of pathways.

Highlights

Transcriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status
Pathway space representation In order to assess the contribution of pathway information to the robustness of predictive models we defined pathway scores that combined the expression of genes in each pathway using principal components analysis (PCA)
Models in pathway space are more robust to noise than models in gene space We examined the robustness of predictive models to degradation of the raw data

Summary

Introduction

Transcriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Data from all omics technologies are subject to a wide variety of technical noise and biological variation, which will degrade the performance of these models and limit the fidelity with which predictive signatures can be identified. Note that the uncontrolled variation may result from a variety of sources, including both technical noise and biological (inter-subject) variation The latter often dominates the total variance in typical datasets, but it may not be useful for predicting the phenotype of interest. It can be seen as a kind of biological ‘noise’ against which the model must remain robust.

Objectives

Methods

Results

Discussion

Conclusion