Abstract

A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein. Developing genome-wide prediction tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. To this end, we performed a comprehensive study in Arabidopsis and created an integrative support vector machine-based localization predictor called AtSubP (for Arabidopsis subcellular localization predictor) that is based on the combinatorial presence of diverse protein features, such as its amino acid composition, sequence-order effects, terminal information, Position-Specific Scoring Matrix, and similarity search-based Position-Specific Iterated-Basic Local Alignment Search Tool information. When used to predict seven subcellular compartments through a 5-fold cross-validation test, our hybrid-based best classifier achieved an overall sensitivity of 91% with high-confidence precision and Matthews correlation coefficient values of 90.9% and 0.89, respectively. Benchmarking AtSubP on two independent data sets, one from Swiss-Prot and another containing green fluorescent protein- and mass spectrometry-determined proteins, showed a significant improvement in the prediction accuracy of species-specific AtSubP over some widely used "general" tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT, Plant-PLoc, and our newly created All-Plant method. Cross-comparison of AtSubP on six nontrained eukaryotic organisms (rice [Oryza sativa], soybean [Glycine max], human [Homo sapiens], yeast [Saccharomyces cerevisiae], fruit fly [Drosophila melanogaster], and worm [Caenorhabditis elegans]) revealed inferior predictions. AtSubP significantly outperformed all the prediction tools being currently used for Arabidopsis proteome annotation and, therefore, may serve as a better complement for the plant research community. A supplemental Web site that hosts all the training/testing data sets and whole proteome predictions is available at http://bioinfo3.noble.org/AtSubP/.

Highlights

  • A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein

  • In the 5-fold cross-validation test, of all the diverse approaches followed to attain maximum performance, the best overall sensitivity was achieved from a hybrid-based technique (H-IX) combining the simple amino acid composition (AA), Position-Specific Scoring Matrix (PSSM)-based evolutionary information, and terminal-based N-Center-C composition with the binary output of Position-Specific Iterated (PSI)-BLAST (Table I)

  • We have provided a few examples from experimentally proven sequences available at SUBcellular location database for Arabidopsis (SUBA), where TargetP provided incorrect or no prediction results whereas the AtSubP predictions correctly matched with the corresponding GFP data (Supplemental Table S21)

Read more

Summary

Introduction

A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein. Developing genome-wide prediction tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. A recent computational effort was made in developing a plant species-specific prediction system, RSLpred, for genome-wide subcellular localization annotations of rice (Oryza sativa) proteins (Kaundal and Raghava, 2009). We only know the subcellular localization of about 6,000 proteins that are experimentally proven (e.g. using GFP fusions, mass spectrometry [MS], or other approaches) out of the total 27,379 protein-coding genes as predicted by The Arabidopsis Information Resource (TAIR) release 9 (www.arabidopsis.org) To narrow this huge gap between the large number of predicted genes in the Arabidopsis genome and the limited experimental characterization of their corresponding proteins, a fully automatic and reliable prediction system for complete subcellular annotation of the Arabidopsis proteome would be very useful

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call