Abstract

Accurate detection of pathogenic single nucleotide variants (SNVs) is a key challenge in whole exome and whole genome sequencing studies. To date, several in silico tools have been developed to predict deleterious variants from this type of data. However, these tools have limited power to detect new pathogenic variants, especially in non-coding regions. In this study, we evaluate the use of a new metric, the Shannon Entropy of Locus Variability (SELV), calculated as the Shannon entropy of the variant frequencies reported in genome-wide population studies at a given locus, as a new predictor of potentially pathogenic variants in non-coding nuclear and mitochondrial DNA and also in coding regions with a selective pressure other than that imposed by the genetic code, e.g splice-sites. For benchmarking, SELV was compared to predictors of pathogenicity in different genomic contexts. In nuclear non-coding DNA, SELV outperformed CDTS (AUCSELV = 0.97 in ROC curve and PR-AUCSELV = 0.96 in Precision-recall curve). For non-coding mitochondrial variants (AUCSELV = 0.98 in ROC curve and PR-AUCSELV = 1.00 in Precision-recall curve) SELV outperformed HmtVar. Moreover, SELV was compared against two state-of-the-art ensemble predictors of pathogenicity in splice-sites, ada-score, and rf-score, matching their overall performance both in ROC (AUCSELV = 0.95) and Precision-recall curves (PR-AUC = 0.97), with the advantage that SELV can be easily calculated for every position in the genome, as opposite to ada-score and rf-score. Therefore, we suggest that the information about the observed genetic variability in a locus reported from large scale population studies could improve the prioritization of SNVs in splice-sites and in non-coding regions.

Highlights

  • Whole Exome and Whole Genome Sequencing (WES, WGS) have revolutionized the way we study a range of genetic diseases

  • MATERIALS AND METHODS Prediction of variants in splice-sites For the prediction of deleterious variants in splice-sites, we built a dataset with 131,002 unique variants (65,734 pathogenic, 65,268 neutral) retrieved from five independent benchmark data-sets HumVar [9], ExoVar [10], VariBench [11], predictSNP [12] and SwissVar [13] and variants selected from Clinvar [14], classified as benign or pathogenic variants

  • This result confirms that locus variability is a distinctive feature of variants located in splice sites vs coding regions of the genome

Read more

Summary

Introduction

Whole Exome and Whole Genome Sequencing (WES, WGS) have revolutionized the way we study a range of genetic diseases. Given the high degree of human variability, WES/WGS analysis renders a large number of variants, making it challenging to discriminate pathogenic from neutral variants. For this purpose, researchers have built several predictors to aid variant prioritization for the detection of deleterious variants. GnomAD represents the greatest effort to summarize human population genetic variability, including around 241 million variants detected in 125,748 WES and 15,708 WGS from unrelated individuals in v2.1 and 76,156 WGS from unrelated individuals in v3.1 [1]. It is interesting to study how the number and distribution of different variants present in a population at a given locus, might provide relevant information about the pathogenicity of the variants

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call