Abstract

State-of-the-art methods assessing pathogenic non-coding variants have mostly been characterized on common disease-associated polymorphisms, yet with modest accuracy and strong positional biases. In this study, we curated 737 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. In addition to interspecies conservation, a comprehensive set of recent and ongoing purifying selection signals in humans is explored, accounting for lineage-specific regulatory elements. Supervised learning using gradient tree boosting on such features achieves a high predictive performance and overcomes positional bias. NCBoost performs consistently across diverse learning and independent testing data sets and outperforms other existing reference methods.

Highlights

  • To date, more than 4000 Mendelian diseases have been clinically recognized [1], collectively affecting more than 25 million people in the USA only [2]

  • Curation of a high-confidence set of pathogenic non-coding variants associated with monogenic Mendelian disease genes Pathogenic non-coding variants from the Human Gene Mutation Database [37] (HGMD-DM), ClinVar [38], and Smedley’2016 [20] were manually curated to obtain a high-confidence set of pathogenic variants associated to monogenic Mendelian diseases

  • Our curation effort allowed further refining this set to retain the fraction of pathogenic variants confidently associated with monogenic Mendelian diseases genes (84%, 87%, and 98%, respectively; Fig. 1b)

Read more

Summary

Introduction

More than 4000 Mendelian diseases have been clinically recognized [1], collectively affecting more than 25 million people in the USA only [2]. Around 50% of all known Mendelian diseases still lack the identification of the causal gene or variant [3]. Despite the progress achieved through whole exome sequencing (WES)-based studies, recent reviews show highly heterogeneous diagnostic rates across disease types [4, 5], ranging from < 15% (such as congenital diaphragmatic hernia or syndromic congenital heart disease) to > 70% (e.g., ciliary dyskinesia). In those scenarios, a common working hypothesis is that noncoding variants could explain the etiology of many of the unresolved cases [5]. Whole genome sequencing (WGS) allows expanding the survey of pathogenic

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.