Abstract
State-of-the-art methods assessing pathogenic non-coding variants have mostly been characterized on common disease-associated polymorphisms, yet with modest accuracy and strong positional biases. In this study, we curated 737 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. In addition to interspecies conservation, a comprehensive set of recent and ongoing purifying selection signals in humans is explored, accounting for lineage-specific regulatory elements. Supervised learning using gradient tree boosting on such features achieves a high predictive performance and overcomes positional bias. NCBoost performs consistently across diverse learning and independent testing data sets and outperforms other existing reference methods.
Highlights
To date, more than 4000 Mendelian diseases have been clinically recognized [1], collectively affecting more than 25 million people in the USA only [2]
Curation of a high-confidence set of pathogenic non-coding variants associated with monogenic Mendelian disease genes Pathogenic non-coding variants from the Human Gene Mutation Database [37] (HGMD-DM), ClinVar [38], and Smedley’2016 [20] were manually curated to obtain a high-confidence set of pathogenic variants associated to monogenic Mendelian diseases
Our curation effort allowed further refining this set to retain the fraction of pathogenic variants confidently associated with monogenic Mendelian diseases genes (84%, 87%, and 98%, respectively; Fig. 1b)
Summary
More than 4000 Mendelian diseases have been clinically recognized [1], collectively affecting more than 25 million people in the USA only [2]. Around 50% of all known Mendelian diseases still lack the identification of the causal gene or variant [3]. Despite the progress achieved through whole exome sequencing (WES)-based studies, recent reviews show highly heterogeneous diagnostic rates across disease types [4, 5], ranging from < 15% (such as congenital diaphragmatic hernia or syndromic congenital heart disease) to > 70% (e.g., ciliary dyskinesia). In those scenarios, a common working hypothesis is that noncoding variants could explain the etiology of many of the unresolved cases [5]. Whole genome sequencing (WGS) allows expanding the survey of pathogenic
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.