Abstract

Discovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Highlights

  • Discovering genes involved in complex human genetic disorders is a major challenge

  • Each study provided scores for genes based on the author’s assessment of their probability of contributing to autism spectrum disorder (ASD) risk. We evaluated their ability to prioritize novel high-confidence and known high-confidence ASD genes using receiver operating characteristic (ROC) and Precision-Recall curves, and 95% confidence intervals of area under the ROC curve (AUROC) and precision at 20% recall

  • Systems‐based guilt by association (GBA) machine learning (ML) methods do not prioritize novel high‐confidence ASD genes well compared to other disease gene prioritization methods

Read more

Summary

Introduction

Discovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. Annotations and number of associations can be correlated, and this turns out to be a driver of GBA behavior: GBA tends to ascribe new functions to genes which are highly connected within the network rather than learning additional, novel information from the connection p­ atterns[6,8]. The implication of this “multifunctionality bias” is that GBA can seem to work in cross-validation settings, while providing predictions with little specific value.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call