Abstract

BackgroundIn silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task.ResultsUsing gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms.ConclusionOur results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.

Highlights

  • In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score

  • High areas under receiver operating characteristic curve (AUC) values were obtained from the stratified cross-validations of inductive CGP experiments

  • support vector machines (SVM) achieved near-perfect AUCs in both Streptococcus agalactiae 2603 (SA-2603) C and B validation sets, whereas alternating decision tree (ADTree) had the best AUC of 0.975 in M set genes

Read more

Summary

Introduction

In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. Supervised approaches of phylogenetic profiles have been applied to infer protein networks (with canonical correlation analysis [16]) and predicting protein functional class in Saccharomyces cerevisiae (with tree-based kernels [17]), in the discovery of protein localisation in eukaryotes [18], in functional annotation of genes (by correlation enrichments [19]). These studies suggested that the concept of phylogenetic profiles provides a valuable tool for predicting gene-function linkage. It was hypothesised that such concept can be exploited as gene features for prioritising genes contributing to a particular phenotypic trait of interest, providing a practical and generalisable tool to guide microbiologists in gene selection

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.