Abstract

There are currently 151 plants with draft genomes available but levels of functional annotation for putative protein products are low. Therefore, accurate computational predictions are essential to annotate genomes in the first instance, and to provide focus for the more costly and time consuming functional assays that follow. DNA-binding proteins are an important class of proteins that require annotation, but current computational methods are not applicable for genome wide predictions in plant species. Here, we explore the use of species and lineage specific models for the prediction of DNA-binding proteins in plants. We show that a species specific support vector machine model based on Arabidopsis sequence data is more accurate (accuracy 81%) than a generic model (74%), and based on this we develop a plant specific model for predicting DNA-binding proteins. We apply this model to the tomato proteome and demonstrate its ability to perform accurate high-throughput prediction of DNA-binding proteins. In doing so, we have annotated 36 currently uncharacterised proteins by assigning a putative DNA-binding function. Our model is publically available and we propose it be used in combination with existing tools to help increase annotation levels of DNA-binding proteins encoded in plant genomes.

Highlights

  • There are currently 151 draft plant genomes available in the NCBI genome database, collectively representing more than 115 gigabases

  • The first large-scale Critical Assessment of protein Function Annotation (CAFA) experiment featured more than 50 competing algorithms [3], and found that tools which predicted molecular functions provided a greater level of accuracy compared to those which predicted a protein’s involvement in a biological process

  • We identified five limitations shared by previous models: (i) the use of sequences from mixed prokaryotic, eukaryotic and species data sets for training, (ii) the restriction of training data sets to proteins with solved structures bound to DNA, (iii) reliance upon evolutionary relationships evident in position-specific-scoring matrices (PSSMs), (iv) use of complex models with large numbers of feature

Read more

Summary

INTRODUCTION

There are currently 151 draft plant genomes available in the NCBI genome database (accessed 09/02/15), collectively representing more than 115 gigabases. A number of models have previously been developed to predict DNA-BPs from amino acid sequence (summarised in Table 1), but most have many limitations that restrict their application to whole genome DNA-BP annotation in plants. There is a need to assess the performance of species/lineage specific prediction models for DNA-BPs. The use of DNA-BPs for which there is a solved structure of the protein–DNA complex available in the Protein Data Bank (PDB), severely restricts the number of proteins that can be used to effectively train an SVM. For the development of our model we firstly demonstrate that using species specific prediction models, with the dicotyledonous model plant Arabidopsis (Arabidopsis thaliana) and yeast (Saccharomyces cerevisiae), gives more accurate predictions than the generic DNAbinder model [10] Building upon this we created a plant specific model and tested its application to realistic data sets, designed to simulate the relative proportion of DNA-BPs within a plant genome. These predictions reveal a large number of tomato proteins (1459) that have possible DNA binding activity and from these, we highlight 36 currently uncharacterised proteins, which we propose to be putative DNA-BPs

MATERIALS AND METHODS
Evaluation of SVM prediction models
RESULTS AND DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call