Abstract

The rapid advancement of next generation sequencing technology has greatly accelerated the progress for understanding human inherited diseases via such innovations as exome sequencing. Nevertheless, the identification of causative variants from sequencing data remains a great challenge. Traditional statistical genetics approaches such as linkage analysis and association studies have limited power in analyzing exome sequencing data, while relying on simply filtration strategies and predicted functional implications of mutations to pinpoint pathogenic variants are prone to produce false positives. To overcome these limitations, we herein propose a supervised learning approach, termed snvForest, to prioritize candidate nonsynonymous single nucleotide variants for a specific type of disease by integrating 11 functional scores at the variant level and 8 association scores at the gene level. We conduct a series of large-scale in silico validation experiments, demonstrating the effectiveness of snvForest across 2,511 diseases of different inheritance styles and the superiority of our approach over two state-of-the-art methods. We further apply snvForest to three real exome sequencing data sets of epileptic encephalophathies and intellectual disability to show the ability of our approach to identify causative de novo mutations for these complex diseases. The online service and standalone software of snvForest are found at http://bioinfo.au.tsinghua.edu.cn/jianglab/snvforest.

Highlights

  • A hallmark of exome sequencing is the ability to identify rare nonsynonymous single nucleotide variants, which occur in low allele frequency (≤ 1%) and are of particular interest for the discovery of novel disease-causing variants

  • We based the design of our method, termed snvForest, on the notion that the inference of disease-causing nonsynonymous single nucleotide variants should be made through the integration of genomic information at both variant and gene levels

  • We adopted a supervised learning method called the random forest to predict the strength of associations between the candidate variants and the query disease based on 11 functional scores at the variant level and 8 association scores at the gene level

Read more

Summary

Introduction

A hallmark of exome sequencing is the ability to identify rare nonsynonymous single nucleotide variants (nsSNV), which occur in low allele frequency (≤ 1%) and are of particular interest for the discovery of novel disease-causing variants. Sifrim et al proposed a method called eXtasy[23] that combined 7 types of variant functional prediction scores, 2 types of gene association scores and several disease phenotype-related scores through Endevour[24] to prioritize candidate nonsynonymous variants. Wu et al developed a method named SPRING25 that integrated 6 types of variant scores and 5 types of gene scores with a rigorous statistical model to predict disease-causing variants Even though these methods did improve the accuracy of inferring pathogenic variants in exome sequencing studies, they suffered from their respective limitations. We provide the online service and the standalone software of snvForest at http://bioinfo.au.tsinghua.edu.cn/jianglab/snvforest

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.