Abstract

Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation. Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations into a single, quantitative score. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects, and complex trait associations, and highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current annotation.

Highlights

  • E gene discovery[1,2,3,4]

  • The annotations span a range of data types including conservation metrics like GERP8, phastCons[9], and phyloP10; regulatory information[11] like genomic regions of DNase hypersensitivity[18] and transcription factor binding[19]; transcript information like distance to exon-intron boundaries or expression levels in commonly studied cell lines[11]; and protein-level scores like Grantham[20], SIFT7, and PolyPhen[6]

  • Given its considerable superiority over the best available protein-based and conservation metrics in terms of ranking known pathogenic variants in the complete spectrum of variation within personal genomes, it is likely that Combined Annotation Dependent Depletion (CADD) will improve the power of sequence-based disease studies beyond current standard approaches

Read more

Summary

Introduction

E gene discovery[1,2,3,4]. For example, exome sequencing is an effective discovery strategy because it focuses on protein-altering variation, which is enriched for causal effects[5]. Conservation metrics[8,9,10] are defined genome-wide but do not use functional information and are not allele-specific, while protein-based metrics[6,7] apply only to coding, and often only to missense, variants, thereby excluding >99% of human genetic variation.

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.