A genome-wide approach to link genotype to clinical outcome by utilizing next generation sequencing and gene chip data of 6,697 breast cancer patients.

Lőrinc Pongor,András Szabó,Christos Hatzis,Lajos Pusztai,Máté Kormos,Balázs Győrffy

doi:10.1186/s13073-015-0228-1

Lőrinc Pongor, András Szabó + Show 4 more

Open Access

https://doi.org/10.1186/s13073-015-0228-1

Copy DOI

Abstract

BackgroundThe use of somatic mutations for predicting clinical outcome is difficult because a mutation can indirectly influence the function of many genes, and also because clinical follow-up is sparse in the relatively young next generation sequencing (NGS) databanks. Here we approach this problem by linking sequence databanks to well annotated gene-chip datasets, using a multigene transcriptomic fingerprint as a link between gene mutations and gene expression in breast cancer patients.MethodsThe database consists of 763 NGS samples containing mutational status for 22,938 genes and RNA-seq data for 10,987 genes. The gene chip database contains 5,934 patients with 10,987 genes plus clinical characteristics. For the prediction, mutations present in a sample are first translated into a ‘transcriptomic fingerprint’ by running ROC analysis on mutation and RNA-seq data. Then correlation to survival is assessed by computing Cox regression for both up- and downregulated signatures.ResultsAccording to this approach, the top driver oncogenes having a mutation prevalence over 5 % included AKT1, TRANK1, TRAPPC10, RPGR, COL6A2, RAPGEF4, ATG2B, CNTRL, NAA38, OSBPL10, POTEF, SCLT1, SUN1, VWDE, MTUS2, and PIK3CA, and the top tumor suppressor genes included PHEX, TP53, GGA3, RGS22, PXDNL, ARFGEF1, BRCA2, CHD8, GCC2, and ARMC4. The system was validated by computing correlation between RNA-seq and microarray data (r2 = 0.73, P < 1E-16). Cross-validation using 20 genes with a prevalence of approximately 5 % confirmed analysis reproducibility.ConclusionsWe established a pipeline enabling rapid clinical validation of a discovered mutation in a large breast cancer cohort. An online interface is available for evaluating any human gene mutation or combinations of maximum three such genes (http://www.g-2-o.com).Electronic supplementary materialThe online version of this article (doi:10.1186/s13073-015-0228-1) contains supplementary material, which is available to authorized users.

Highlights

The use of somatic mutations for predicting clinical outcome is difficult because a mutation can indirectly influence the function of many genes, and because clinical follow-up is sparse in the relatively young generation sequencing (NGS) databanks
The most important genes identified in 100 primary breast cancers included AKT1, BRCA1, CDH1, GATA3, PIK3CA, PTEN, RB, TP53, ARID1B, CASP8, and MAP3K1 [7]
Database setup Central to our approach is the joint analysis of three breast cancer datasets, including somatic mutations and RNA-seq gene expression from the the Cancer Genome Atlas (TCGA) project and microarray and detailed survival data for a separate large cohort of breast cancer patients

Summary

Introduction

The use of somatic mutations for predicting clinical outcome is difficult because a mutation can indirectly influence the function of many genes, and because clinical follow-up is sparse in the relatively young generation sequencing (NGS) databanks. We approach this problem by linking sequence databanks to well annotated gene-chip datasets, using a multigene transcriptomic fingerprint as a link between gene mutations and gene expression in breast cancer patients. The most important genes identified in 100 primary breast cancers included AKT1, BRCA1, CDH1, GATA3, PIK3CA, PTEN, RB, TP53, ARID1B, CASP8, and MAP3K1 [7]. Ellis and associates performed NGS on biopsies from two neoadjuvant aromatase inhibitor clinical trials and found PIK3CA, TP53, GATA3, CDH1, RB1, MLL3, MAP3K1, and CDKN1B to be the primary genes affected [10]

Objectives

Methods

Results

Discussion

Conclusion