Identification of copy number variants in whole-genome data using Reference Coverage Profiles.

Gustavo Glusman,Max Robinson,Mary E Brunkow,Dale L Bodian,Seth A Ament,Joseph G Vockley,Jared C Roach,Ilya Shmulevich,Varsha Dhankani,Denise E Mauldin,Alissa Severson,John E Niederhuber,Leroy Hood,Anna B Stittrich,Terry Farrah

doi:10.3389/fgene.2015.00045

Gustavo Glusman, Max Robinson + Show 13 more

Open Access

https://doi.org/10.3389/fgene.2015.00045

Copy DOI

Abstract

The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation.

Highlights

Deletions, duplications and other copy number variations (CNVs) are important components of genomic structural variation (SV), which need to be assessed when studying individual genomes in a personal or clinical context
A MODULAR METHOD FOR COVERAGE ANALYSIS We have developed a new method for identification of deletions and copy-number variant/variation (CNV) in personal genomes, based on whole-genome sequencing (WGS) depth of coverage
Whereas the depth of coverage fluctuates much more strongly along a single CGI genome assembly than along a single Illumina genome assembly, the fluctuation is much more consistent and predictable from one CGI assembly to another than from one Illumina assembly to another. These results suggest that computational methods for detection of CNVs not explicitly correcting for locus-specific coverage differences should be more useful for analyzing genomes sequenced on the Illumina platform than when interpreting CGI genomes

Summary

Introduction

Duplications and other copy number variations (CNVs) are important components of genomic structural variation (SV), which need to be assessed when studying individual genomes in a personal or clinical context. Read-pair algorithms consider discordant pairs of reads, or pairs that diverge from the expected size or orientation. They cluster these reads into independent events and apply quality filters. In an effort to improve sensitivity, some methods include ambiguously mapped reads. Called soft clustering, these approaches assign the ambiguous reads to a mapping that clusters with an event. These approaches assign the ambiguous reads to a mapping that clusters with an event Tools that employ this method include HYDRA (Quinlan et al, 2010), www.frontiersin.org

Methods

Results

Conclusion