Abstract

We present a novel ab initio predictor of protein enzymatic class. The predictor can classify proteins, solely based on their sequences, into one of six classes extracted from the enzyme commission (EC) classification scheme and is trained on a large, curated database of over 6,000 non-redundant proteins which we have assembled in this work. The predictor is powered by an ensemble of N-to-1 Neural Network, a novel architecture which we have recently developed. N-to-1 Neural Networks operate on the full sequence and not on predefined features. All motifs of a predefined length (31 residues in this work) are considered and are compressed by an N-to-1 Neural Network into a feature vector which is automatically determined during training. We test our predictor in 10-fold cross-validation and obtain state of the art results, with a 96% correct classification and 86% generalized correlation. All six classes are predicted with a specificity of at least 80% and false positive rates never exceeding 7%. We are currently investigating enhanced input encoding schemes which include structural information, and are analyzing trained networks to mine motifs that are most informative for the prediction, hence, likely, functionally relevant.

Highlights

  • Genome sequencing projects and high-throughput experimental procedures have produced a rapid growth in protein databases but only a small fraction of known sequences have been determined to have a function by experimental means

  • In spite of substantial interest by the research community in the prediction of protein functions, this, to date, remains a difficult problem for a number of reasons, partly because function itself is to an extent ill-defined, partly because we still lack a complete understanding of the complex relationship between sequences, structures and functions

  • It has been shown that secondary structure does not provide enough information to classify functions [17], and this is especially true for enzymes, which can both exibit large amounts of structural variation within a single class, and, by converse, very different enzymatic activities in spite of nearly identical structures

Read more

Summary

Introduction

Genome sequencing projects and high-throughput experimental procedures have produced a rapid growth in protein databases but only a small fraction of known sequences have been determined to have a function by experimental means. Determining or accurately predicting protein functions and enhancing the annotation of sequence databases is of paramount importance, in order to expand our knowledge of the mechanisms of life and to develop new drugs [1]. If some predictive methods rely on amino acid sequence analysis only, others take advantage of physio-chemical and structural properties or phylogenetic information and protein interactions while many others rely a combination of multiple data types. Predicting protein function from the three-dimensional structure has been the most successful method but, since protein structures are known for less than 1% of known protein sequences, most proteins of newly sequenced genomes have to be characterized by their amino-acid sequences alone [1]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call