Abstract

BackgroundAccurate detection of somatic single nucleotide variants and small insertions and deletions from DNA sequencing experiments of tumour-normal pairs is a challenging task. Tumour samples are often contaminated with normal cells confounding the available evidence for the somatic variants. Furthermore, tumours are heterogeneous so sub-clonal variants are observed at reduced allele frequencies. We present here a cell-line titration series dataset that can be used to evaluate somatic variant calling pipelines with the goal of reliably calling true somatic mutations at low allele frequencies.ResultsCell-line DNA was mixed with matched normal DNA at 8 different ratios to generate samples with known tumour cellularities, and exome sequenced on Illumina HiSeq to depths of >300×. The data was processed with several different variant calling pipelines and verification experiments were performed to assay >1500 somatic variant candidates using Ion Torrent PGM as an orthogonal technology. By examining the variants called at varying cellularities and depths of coverage, we show that the best performing pipelines are able to maintain a high level of precision at any cellularity. In addition, we estimate the number of true somatic variants undetected as cellularity and coverage decrease.ConclusionsOur cell-line titration series dataset, along with the associated verification results, was effective for this evaluation and will serve as a valuable dataset for future somatic calling algorithm development. The data is available for further analysis at the European Genome-phenome Archive under accession number EGAS00001001016. Data access requires registration through the International Cancer Genome Consortium’s Data Access Compliance Office (ICGC DACO).Electronic supplementary materialThe online version of this article (doi:10.1186/s13104-015-1803-7) contains supplementary material, which is available to authorized users.

Highlights

  • Accurate detection of somatic single nucleotide variants and small insertions and deletions from DNA sequencing experiments of tumour-normal pairs is a challenging task

  • The Cancer Genome Atlas (TCGA) has made available a dataset consisting of sequencing reads from two public cell-lines with matched normals that were synthetically mixed together at varying ratios, and an additional dataset with a sub-clone simulated by artificially introducing variants [12]

  • We evaluated several data analysis pipelines that included two different sequence alignment tools (BWA [14] and Novoalign [15]), realignment and recalibration using the Genome Analysis Tool Kit (GATK) [16], and six different somatic variant callers (GATK, JointSNVMix [4], MuTect [5], Somatic Sniper [6], Strelka [7] and VarScan 2.3.2 [8])

Read more

Summary

Introduction

Accurate detection of somatic single nucleotide variants and small insertions and deletions from DNA sequencing experiments of tumour-normal pairs is a challenging task. We present here a cell-line titration series dataset that can be used to evaluate somatic variant calling pipelines with the goal of reliably calling true somatic mutations at low allele frequencies. Several groups have developed tools to identify cancer specific mutations from sequencing of tumour and normal sample pairs [4,5,6,7,8]. The Cancer Genome Atlas (TCGA) has made available a dataset consisting of sequencing reads from two public cell-lines with matched normals that were synthetically mixed together at varying ratios, and an additional dataset with a sub-clone simulated by artificially introducing variants [12]. The TCGA dataset has been included as part of the ICGC-TCGA DREAM Mutation Calling challenge, which is ongoing [13]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.