Abstract

Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

Highlights

  • The extent to which genetic differences among humans are associated with human disease susceptibility is still unknown [1]

  • Note that the size of a copy number variants (CNVs) is inherently related to the number of markers covering the CNV which depends on the marker distribution of the specific platform in question

  • The False Positive Rate (FPR) for detecting CNVs is provided in greater details in File S1

Read more

Summary

Introduction

The extent to which genetic differences among humans are associated with human disease susceptibility is still unknown [1]. Sometimes referred to as genomic variability, can be of several types, including single nucleotide polymorphism (SNPs) and structural variation mainly consisting of DNA copy number changes. Recent studies have shown that structural variation can account for variability in as much as 0.7% of the total nucleotide content, of which CNVs are the largest component [4]. Unlike the catalogue of known SNPs, the number and characterization of CNVs in humans remain incomplete. Earlier this year, Conrad et al 2010 [5] presented the most comprehensive population-based CNV map where they have discovered 80–90% of common CNVs (Minor Allele Frequency (MAF).5%) greater than 1 kb in length and have been able to genotype approximately 40% of these. It is believed that CNVs, especially smaller ones, and INDELs are underrepresented in existing databases and require better characterization [5]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call