Abstract

Copy number variants (CNVs) play an important role in a number of human diseases, but the accurate calling of CNVs remains challenging. Most current approaches to CNV detection use raw read alignments, which are computationally intensive to process. We use a regression tree‐based approach to call germline CNVs from whole‐genome sequencing (WGS, >18x) variant call sets in 6,898 samples across four European cohorts, and describe a rich large variation landscape comprising 1,320 CNVs. Eighty‐one percent of detected events have been previously reported in the Database of Genomic Variants. Twenty‐three percent of high‐quality deletions affect entire genes, and we recapitulate known events such as the GSTM1 and RHD gene deletions. We test for association between the detected deletions and 275 protein levels in 1,457 individuals to assess the potential clinical impact of the detected CNVs. We describe complex CNV patterns underlying an association with levels of the CCL3 protein (MAF = 0.15, p = 3.6x10−12) at the CCL3L3 locus, and a novel cis‐association between a low‐frequency NOMO1 deletion and NOMO1 protein levels (MAF = 0.02, p = 2.2x10−7). This study demonstrates that existing population‐wide WGS call sets can be mined for germline CNVs with minimal computational overhead, delivering insight into a less well‐studied, yet potentially impactful class of genetic variant.

Highlights

  • Up to 19.2% of the human genome is susceptible to copy number variation, which can have a severe impact on gene function(Zarrei, et al, 2015)

  • Whole-genome sequencing (WGS) at high depth has been the gold standard for detecting large polymorphisms

  • Copy number variants (CNVs) calling in 6,898 European samples We apply Unimaginatively Named CNV caller (UN-CNVc) on WGS data from 6,898 samples across four studies: the MANOLIS and Pomak isolated cohorts from the HELIC study, the TEENAGE cohort of Greek adolescents, and the INTERVAL study of blood donors in the UK

Read more

Summary

Introduction

Up to 19.2% of the human genome is susceptible to copy number variation, which can have a severe impact on gene function(Zarrei, et al, 2015). Even recent WGS-based structural variant studies are usually made in a limited number of samples or concentrated on targeted regions of the genome(Kayser, et al, 2018; Lu, et al, 2017; Zarrei, et al, 2015) This is because detecting structural variants requires a different study design compared to association studies: whereas for the latter, haplotype diversity and sample size are key (Alex Buerkle and Gompert, 2013; Le and Durbin, 2011), for the former, high depth of sequencing is paramount, leading to prohibitive costs for population-wide studies. Structural variant detection poses a computational challenge, since most algorithms use aligned reads or read pileups as a starting point for event detection As these file formats describe the entire read pool, processing them genome-wide across an entire population with high-depth WGS is demanding both in terms of running time and memory. We evaluate the effect of copy number variants on sequencing depth measured at variant sites using a novel tool (UN-CNVc), and provide a proof-of-concept for calling these large variations in population-wide WGS variant call sets

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call