Abstract

Advances in genome sequencing and annotation have eased the difficulty of identifying new gene sequences. Predicting the functions of these newly identified genes remains challenging. Genes descended from a common ancestral sequence are likely to have common functions. As a result, homology is widely used for gene function prediction. This means functional annotation errors also propagate from one species to another. Several approaches based on machine learning classification algorithms were evaluated for their ability to accurately predict gene function from non-homology gene features. Among the eight supervised classification algorithms evaluated, random-forest-based prediction consistently provided the most accurate gene function prediction. Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms, the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest. GO prediction models trained with homology-based annotations were able to successfully predict annotations from a manually curated "gold standard" GO annotation set. Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors which were propagated through homology-based functional annotations.

Highlights

  • The rapid acceleration in genome sequencing is providing complete sequences for dozens of new plant species each year (Chen et al, 2018; Michael & Jackson, 2013)

  • Of the 28,775 annotated gene models in the TAIR10 A. thaliana reference genome, only 19.2% have functional annotations supported by mutant phenotypes and 24.5% have functional annotations supported by other types of experimental evidence

  • We evaluated the potential for using supervised machine-learning-based classification algorithms to predict the function of annotated maize genes using purely non-homology-based features, and seek to determine which kinds of molecular, structural, or chromatin features are likely to be more or less beneficial additions when estimating gene function using algorithms of this type

Read more

Summary

Introduction

The rapid acceleration in genome sequencing is providing complete sequences for dozens of new plant species each year (Chen et al, 2018; Michael & Jackson, 2013). Of the 28,775 annotated gene models in the TAIR10 A. thaliana reference genome, only 19.2% have functional annotations supported by mutant phenotypes (evidence code IMP) and 24.5% have functional annotations supported by other types of experimental evidence (e.g. IDA, IPI, IGI, IEP, HAD (inferred from high throughput direct assay) (inferred from high throughput direct assay), and HEP (inferred from high throughput expression pattern). An additional 30.4% of A. thaliana gene models are functionally annotated based on solely protein features, sequence similarity, or other forms of evidence which are used to infer homology These include GO (gene ontology) terms supported by the evidence codes ISS (inferred from sequence or structural similarity), ISM (inferred from sequence model), IBA (inferred from biological aspect of ancestor), IEA (inferred from electronic annotation), and RCA (inferred from reviewed computational analysis). The final 6.2% of Arabidopsis gene models lack any functional annotation (Lamesch et al, 2011)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call