PureCN: copy number calling and SNV classification using targeted short read sequencing.

Markus Riester,Derek Y Chiang,Kun Yu,Michael P Morrissey,Catarina D Campbell,Angad P Singh,A Rose Brannon

doi:10.1186/s13029-016-0060-z

Markus Riester, Derek Y Chiang + Show 5 more

Open Access

https://doi.org/10.1186/s13029-016-0060-z

Copy DOI

Abstract

BackgroundMatched sequencing of both tumor and normal tissue is routinely used to classify variants of uncertain significance (VUS) into somatic vs. germline. However, assays used in molecular diagnostics focus on known somatic alterations in cancer genes and often only sequence tumors. Therefore, an algorithm that reliably classifies variants would be helpful for retrospective exploratory analyses. Contamination of tumor samples with normal cells results in differences in expected allelic fractions of germline and somatic variants, which can be exploited to accurately infer genotypes after adjusting for local copy number. However, existing algorithms for determining tumor purity, ploidy and copy number are not designed for unmatched short read sequencing data.ResultsWe describe a methodology and corresponding open source software for estimating tumor purity, copy number, loss of heterozygosity (LOH), and contamination, and for classification of single nucleotide variants (SNVs) by somatic status and clonality. This R package, PureCN, is optimized for targeted short read sequencing data, integrates well with standard somatic variant detection pipelines, and has support for matched and unmatched tumor samples. Accuracy is demonstrated on simulated data and on real whole exome sequencing data.ConclusionsOur algorithm provides accurate estimates of tumor purity and ploidy, even if matched normal samples are not available. This in turn allows accurate classification of SNVs. The software is provided as open source (Artistic License 2.0) R/Bioconductor package PureCN (http://bioconductor.org/packages/PureCN/).Electronic supplementary materialThe online version of this article (doi:10.1186/s13029-016-0060-z) contains supplementary material, which is available to authorized users.

Highlights

Matched sequencing of both tumor and normal tissue is routinely used to classify variants of uncertain significance (VUS) into somatic vs. germline
We note that if the assay includes copy number tiling probes highly enriched in heterozygous SNPs, an algorithm (e.g. FACETS or PSCBS [8, 23]) that jointly segments coverage and allelic fractions can sometimes provide better results and we provide a convenient wrapper function for using the PSCBS method over the default
Initial estimates of purity and ploidy were obtained in a grid search, and allelic fractions of single nucleotide variant (SNV) were fitted to all local optima

Summary

Results

Example We applied our implementation to whole exome sequencing data from a male breast cancer metastasis sample [24]. Combining these data with the allelic fraction data (Fig. 3a), we find that the LOH of chromosome 1p is copy number neutral, 12p and 16q have LOH due to copy loss and there is a copy number gain of 16p. For this sample, PureCN returned a very similar maximum likelihood purity and ploidy estimate when run with and without the matched normal sample (0.7 for purity and 2.001 for ploidy). For 85% of the samples, the mean difference in absolute copy numbers was within ±1 when comparing PureCN with the Foundation Medicine calls (Fig. 8d)

Conclusions

Background

C i þ1

Limitations