A pathway-based data integration framework for prediction of disease progression

José A Seoane,Tom R Gaunt,Colin Campbell,Ian N M Day

doi:10.1093/bioinformatics/btt610

José A Seoane, Tom R Gaunt + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btt610

Copy DOI

Abstract

Motivation: Within medical research there is an increasing trend toward deriving multiple types of data from the same individual. The most effective prognostic prediction methods should use all available data, as this maximizes the amount of information used. In this article, we consider a variety of learning strategies to boost prediction performance based on the use of all available data.Implementation: We consider data integration via the use of multiple kernel learning supervised learning methods. We propose a scheme in which feature selection by statistical score is performed separately per data type and by pathway membership. We further consider the introduction of a confidence measure for the class assignment, both to remove some ambiguously labeled datapoints from the training data and to implement a cautious classifier that only makes predictions when the associated confidence is high.Results: We use the METABRIC dataset for breast cancer, with prediction of survival at 2000 days from diagnosis. Predictive accuracy is improved by using kernels that exclusively use those genes, as features, which are known members of particular pathways. We show that yet further improvements can be made by using a range of additional kernels based on clinical covariates such as Estrogen Receptor (ER) status. Using this range of measures to improve prediction performance, we show that the test accuracy on new instances is nearly 80%, though predictions are only made on 69.2% of the patient cohort.Availability: https://github.com/jseoane/FSMKLContact: J.Seoane@bristol.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Within the biomedical sciences it is increasingly common to derive multiple types of data from the same individual
The METABRIC data consists of clinical data, such as survival period and data derived from EXP and copy number variation (CNV)
This dataset is derived from a collection of 2000 clinically annotated primary breast cancer specimens with EXP and CNV data derived from each sample, as described in Curtis et al (2012)

Summary

Introduction

Within the biomedical sciences it is increasingly common to derive multiple types of data from the same individual. By maximizing the information content, models that use all the available data are intrinsically more powerful than models that use only one data type For these reasons there has been an increasing interest in data integration methods, both for unsupervised (Agius et al, 2009; Huopaniemi et al, 2010; Rogers et al, 2010; Savage et al, 2010; Yuan et al, 2011) and supervised learning (Bach et al, 2004; Gonen and Alpaydin, 2011; Lanckriet et al, 2004; Rakotomamonjy et al, 2008), and their use with genomic datasets. One way of doing so is to associate a confidence measure with the vote of individual committee members and use these probabilistic measures to define their relative contribution to the final decision. Though, we follow the more direct route of encoding each type of data into objects called kernels and using a weighted combination of these in the final decision function, an approach called multiple kernel learning (MKL) (see Fig. 1). In Damoulas et al (2008), we found that the use of probabilistic assumptions led to a test accuracy less than that achievable by non-probabilistic classifiers

Methods

Results

Conclusion