EnzML: multi-label prediction of enzyme classes using InterPro signatures

Luna De Ferrari,Jano Van Hemert,Stuart Aitken,Igor Goryanin

doi:10.1186/1471-2105-13-61

Abstract

BackgroundManual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function.ResultsWe present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters.ConclusionsInterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN).

Highlights

Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing
Despite some known limitations, such as some inconsistencies between the rules set by the nomenclature committee and the actual class definitions [7], we use the NC-IUBMB Enzyme Commission (EC) nomenclature to define enzymatic reactions, as it is the current standard for enzyme function classification
For each taxonomic domain we have investigated the individual proteome having most proteins in the SwissProt KEGG set: Methanocaldococcus jannaschii for archaea, Escherichia coli for bacteria, Schizosaccharomyces pombe for fungi, Drosophila melanogaster for invertebrates, Arabidopsys thaliana for plants, Homo sapiens for vertebrates

Summary

Introduction

Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. Assigning enzymatic function to the proteins in a genome is one of the first essential steps of metabolic reconstruction, important for biology, medicine, industrial production and environmental studies. At the current rate of genome sequencing and manual annotation, manual curation will never complete the functional annotation of all available proteomes [2]. In this work we propose and evaluate a method to automatically predict the enzymatic functions. Despite some known limitations, such as some inconsistencies between the rules set by the nomenclature committee and the actual class definitions [7], we use the NC-IUBMB Enzyme Commission (EC) nomenclature to define enzymatic reactions, as it is the current standard for enzyme function classification. The first three digits represent an increasingly detailed definition of reaction class, while the last digit represents the accepted substrates

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 25, 2012
Citations: 61	License type: cc-by

R Discovery Prime

R Discovery Prime

EnzML: multi-label prediction of enzyme classes using InterPro signatures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature
Alperen Dalkiran ... Ahmet Sureyya Rifaioglu
BMC Bioinformatics | VOL. 19
Alperen Dalkiran, et. al.Alperen Dalkiran ... Ahmet Sureyya Rifaioglu
21 Sep 2018
BMC Bioinformatics | VOL. 19

Relationship between global structural parameters and Enzyme Commission hierarchy: Implications for function prediction
Marcelo Boareto ... Vitor B.P Leite
Computational Biology and Chemistry | VOL. 40
Marcelo Boareto, et. al.Marcelo Boareto ... Vitor B.P Leite
13 Aug 2012
Computational Biology and Chemistry | VOL. 40

Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context
Yong-Cui Wang ... Yong Wang
BMC Systems Biology | VOL. 5
Yong-Cui Wang, et. al.Yong-Cui Wang ... Yong Wang
20 Jun 2011
BMC Systems Biology | VOL. 5

New avenues in protein function prediction
Iddo Friedberg ... Martin Jambon
Protein Science | VOL. 15
Iddo Friedberg, et. al.Iddo Friedberg ... Martin Jambon
01 Jun 2006
Protein Science | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EnzML: multi-label prediction of enzyme classes using InterPro signatures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics