High resolution measurement of DUF1220 domain copy number from whole genome sequence data

David P Astling,James M Sikela,Kenneth L Jones,Ilea E Heft

doi:10.1186/s12864-017-3976-z

David P Astling, James M Sikela + Show 2 more

Open Access

https://doi.org/10.1186/s12864-017-3976-z

Copy DOI

Journal: BMC Genomics	Publication Date: Aug 14, 2017
Citations: 16	License type: open-access

Affiliation: University of Colorado Denver

Abstract

BackgroundDUF1220 protein domains found primarily in Neuroblastoma BreakPoint Family (NBPF) genes show the greatest human lineage-specific increase in copy number of any coding region in the genome. There are 302 haploid copies of DUF1220 in hg38 (~160 of which are human-specific) and the majority of these can be divided into 6 different subtypes (referred to as clades). Copy number changes of specific DUF1220 clades have been associated in a dose-dependent manner with brain size variation (both evolutionarily and within the human population), cognitive aptitude, autism severity, and schizophrenia severity. However, no published methods can directly measure copies of DUF1220 with high accuracy and no method can distinguish between domains within a clade.ResultsHere we describe a novel method for measuring copies of DUF1220 domains and the NBPF genes in which they are found from whole genome sequence data. We have characterized the effect that various sequencing and alignment parameters and strategies have on the accuracy and precision of the method and defined the parameters that lead to optimal DUF1220 copy number measurement and resolution. We show that copy number estimates obtained using our read depth approach are highly correlated with those generated by ddPCR for three representative DUF1220 clades. By simulation, we demonstrate that our method provides sufficient resolution to analyze DUF1220 copy number variation at three levels: (1) DUF1220 clade copy number within individual genes and groups of genes (gene-specific clade groups) (2) genome wide DUF1220 clade copies and (3) gene copy number for DUF1220-encoding genes.ConclusionsTo our knowledge, this is the first method to accurately measure copies of all six DUF1220 clades and the first method to provide gene specific resolution of these clades. This allows one to discriminate among the ~300 haploid human DUF1220 copies to an extent not possible with any other method. The result is a greatly enhanced capability to analyze the role that these sequences play in human variation and disease.

Highlights

DUF1220 protein domains found primarily in Neuroblastoma BreakPoint Family (NBPF) genes show the greatest human lineage-specific increase in copy number of any coding region in the genome
We carried out a simulation in which 100 bp paired-end reads from each DUF1220 domain were generated from the human reference genome, hg38, and aligned back to the reference to determine the extent to which reads from each domain (CON1, CON2, CON3, HLS1, HLS2, and HLS3) selectively align to the correct gene and clade
With 100 bp paired-end reads, the DUF1220 sequences from eight genes can be uniquely measured; 100% of the reads originating from them align to the originating gene and clade (e.g. NBPF7) (Fig. 2)

Summary

Introduction

DUF1220 protein domains found primarily in Neuroblastoma BreakPoint Family (NBPF) genes show the greatest human lineage-specific increase in copy number of any coding region in the genome. Duplicated sequences, including genes, are prevalent throughout the human genome [1]. While they have been linked to important evolutionary [2, 3] and medical phenotypes [4], they often go unexamined in studies of genetic disease due to their complexity. Previous reports have focused largely on measurement of gene copy number changes but sequences can vary as a result of both gene dosage changes and intragenic domain expansion/contraction. Consideration of this fact is important for two reasons. Intragenic sequence gains or losses can confound estimates of gene copy number, and second, changes in copy number arising from intragenic changes may have different phenotypic effects than those arising from gene dosage changes

Methods

Results

Discussion

Conclusion