Abstract
Characterizing how variation at the level of individual nucleotides contributes to traits and diseases has been an area of growing interest since the completion of sequencing the first human genome. Our understanding of how a single nucleotide polymorphism (SNP) leads to a pathogenic phenotype on a genome-wide scale is a fruitful endeavor for anyone interested in developing diagnostic tests, therapeutics, or simply wanting to understand the etiology of a disease or trait. To this end, many datasets and algorithms have been developed as resources/tools to annotate SNPs. One of the most common practices is to annotate coding SNPs that affect the protein sequence. Synonymous variants are often grouped as one type of variant, however there are in fact many tools available to dissect their effects on gene expression. More recently, large consortiums like ENCODE and GTEx have made it possible to annotate non-coding regions. Although annotating variants is a common technique among human geneticists, the constant advances in tools and biology surrounding SNPs requires an updated summary of what is known and the trajectory of the field. This review will discuss the history behind SNP annotation, commonly used tools, and newer strategies for SNP annotation. Additionally, we will comment on the caveats that distinguish approaches from one another, along with gaps in the current state of knowledge, and potential future directions. We do not intend for this to be a comprehensive review for any specific area of SNP annotation, but rather it will be an excellent resource for those unfamiliar with computational tools used to functionally characterize SNPs. In summary, this review will help illustrate how each SNP annotation method impacts the way in which the genetic and molecular etiology of a disease is explored in-silico.
Highlights
Scientific endeavors in human genetics, molecular biology, biochemistry, and bioinformatics have been progressively converging in order to more precisely describe how DNA variation explains differences in traits and diseases
Single base changes called single nucleotide polymorphisms or SNPs, along with changes where DNA has been inserted or deleted, which are referred to as indels have been popular forms of genetic variation to investigate. Another form of variation is in terms of copy number variants (CNVs), where large portions of the genome are duplicated or deleted
While it was funded by the National Institutes of Health (NIH) and Department of Energy (DOE), it was informally a product of international collaborations [3]
Summary
Scientific endeavors in human genetics, molecular biology, biochemistry, and bioinformatics have been progressively converging in order to more precisely describe how DNA variation explains differences in traits and diseases. ML is a broad term used for algorithms that learn from a training dataset to improve the mathematical model prediction accuracy on a test data set Often, these methods use sequence conservation, amino acid physiochemical properties, gene regulatory annotations, allele frequency among sub-populations, and even the output of other tools. Combined Annotation-Dependent Depletion (CADD) annotations were derived from a support vector machine (SVM), a commonly used ML algorithm, to generate scores for 8.6 billion possible single nucleotide variants (SNVs) in the human reference genome based on 63 annotations that described conservation, gene regulatory information, and population frequencies [40].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.