Abstract

Abstract OnkoInsight is a pipeline designed to detect cancer driver genes from large sequencing datasets. It includes the somatic mutation detection module SomaticSeq, and the novel driver gene detection module GSMuta. SomaticSeq leverages an ensemble approach and machine learning to accurately detect somatic mutations. In Stage 5 of the ICGC-TCGA DREAM Somatic Mutation Challenge, SomaticSeq v1 placed #1 and #2 in the INDEL and SNV sub-challenges. In the current project, we used the improved SomaticSeq v2.2.2, which now extracts most features directly from the BAM files instead of SAMtools, and has the added function of handling multiple variant calls at the same position. We incorporated MuTect, Indelocator, VarScan2, SomaticSniper, VarDict, MuSE, and LoFreq. GSMuta detects regions, genes, and pathways that are enriched for somatic mutations. It identifies cancer drivers and distinguishes between oncogenes and tumor-suppressing genes. As a demonstration of the capability and scalability of OnkoInsight, we deployed the tools as docker images, developed the pipeline using common workflow language, and analyzed over 1,000 TCGA lung cancer patients with tumor-normal whole exome sequencing data on Cancer Genomics Cloud. The project involved 569 adenocarcinoma (LUAD) and 490 squamous cell carcinoma (LUSC) samples. On average, SomaticSeq detected over 700 somatic mutations per sample. The predicted mutation rate was consistent with the expected mutation rates of LUAD and LUSC. Once we obtained the high-confidence somatic mutations from SomaticSeq, we used GSMuta to detect driver genes in LUAD and LUSC separately. We detected 97 and 50 driver genes for LUAD and LUSC, respectively. To assess the quality of GSMuta’s driver gene prediction, we compared the results with known lung cancer driver genes. Indeed, GSMuta reproduced 16 out of the 18 LUAD driver genes reported by TCGA’s landmark study such as EGFR, KRAS, and BRAF. It reported 9 out of 10 LUSC driver genes reported by TCGA such as PTEN. LUAD and LUSC shared nine predicted driver genes, and the pathway disruption was homogenous across the two subtypes. It also detected some potential new driver genes. The project was completed in less than a week for over 1,000 pairs of exomes on the cloud at a cost of less than ten dollars per pair. This demonstrated that our OnkoInsight pipeline is highly scalable and can be deployed to reliably analyze population-sized cancer data sets in a reasonable time frame. (L.T.F., M.M., and Y.F. had equal contribution) Citation Format: Li Tai Fang, Marghoob Mohiyuddin, Yao Fu, Lijing Yao, Narges Bani Asadi, Hugo Y. Lam. OnkoInsight: an end-to-end cancer informatics pipeline to generate insights from large sequencing datasets [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 386. doi:10.1158/1538-7445.AM2017-386

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call