Multi-study Integration of Brain Cancer Transcriptomes Reveals Organ-Level Molecular Signatures

Jaeyun Sung,Leroy Hood,Shuyi Ma,Yuliang Wang,Donald Geman,Nathan D Price,Andrew T Magis,Pan-Jun Kim,Cory C Funk,Isidore Rigoutsos

doi:10.1371/journal.pcbi.1003148

Abstract

We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.

Highlights

One important goal in systems medicine is to develop molecular diagnostics that can accurately and comprehensively report health and disease states of an organ system [1,2]
From a multi-study, integrated transcriptomic dataset, we identified a marker panel for differentiating major human brain cancers at the gene-expression level
We found that sufficient dataset integration across multiple studies greatly enhanced diagnostic performance on truly independent validation sets, whereas signatures learned from only one dataset typically led to high error rate

Summary

Introduction

One important goal in systems medicine is to develop molecular diagnostics that can accurately and comprehensively report health and disease states of an organ system [1,2]. The discovery of organ-level molecular signatures [3] from global biomolecule expression measurements would mark a significant advance toward this goal. In this regard, genome-wide transcriptomic data are readily available, making this a promising source for molecular signatures as well as a good means to study the robustness of signatures across different studies. While many molecular signature studies have focused on identifying differences between case (e.g., cancer) and control (e.g., normal), a more clinically relevant and challenging task is the multi-category classification problem. This task pertains especially to identifying signatures for molecular screening and monitoring purposes. The successful identification of more reliable and efficient molecular signatures will be critical for the blood-based, organ-specific diagnostics envisioned for the future [9]

Author Summary

Findings

Materials and Methods