Abstract

BackgroundRegions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning.ResultsWe examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold 37% of the time and twenty-fold 11% of the time.ConclusionPROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.

Highlights

  • Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes

  • Defining features and building the training set A set of features was chosen based on a comparative study of ~ 18,000 known genes from Ensembl [14] which are not known to be involved in human disease and the 1,084 Ensembl genes listed in Online Mendelian Inheritance in Man (OMIM) [15]

  • We found that the genes listed in OMIM were far more likely to have well conserved best reciprocal hit (BRH) homologs with other species and in particular with mice; this concurs with previous studies [13,16]

Read more

Summary

Introduction

Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. This number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. Regions of interest identified through complex-trait linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. The traditional candidategene approach to reducing this number of genes to a manageable level involves attempting to match functional annotation to knowledge of the disease or phenotype under investigation. This approach has (page number not for citation purposes).

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.