The prediction of organelle-targeting peptides in eukaryotic proteins with Grammatical-Restrained Hidden Conditional Random Fields

Valentina Indio,Castrense Savojardo,Piero Fariselli,Pier Luigi Martelli,Rita Casadio

doi:10.1093/bioinformatics/btt089

Valentina Indio, Castrense Savojardo + Show 3 more

Open Access

https://doi.org/10.1093/bioinformatics/btt089

Copy DOI

Abstract

Targeting peptides are the most important signal controlling the import of nuclear encoded proteins into mitochondria and plastids. In the lack of experimental information, their prediction is an essential step when proteomes are annotated for inferring both the localization and the sequence of mature proteins. We developed TPpred a new predictor of organelle-targeting peptides based on Grammatical-Restrained Hidden Conditional Random Fields. TPpred is trained on a non-redundant dataset of proteins where the presence of a target peptide was experimentally validated, comprising 297 sequences. When tested on the 297 positive and some other 8010 negative examples, TPpred outperformed available methods in both accuracy and Matthews correlation index (96% and 0.58, respectively). Given its very low-false-positive rate (3.0%), TPpred is, therefore, well suited for large-scale analyses at the proteome level. We predicted that from ∼4 to 9% of the sequences of human, Arabidopsis thaliana and yeast proteomes contain targeting peptides and are, therefore, likely to be localized in mitochondria and plastids. TPpred predictions correlate to a good extent with the experimental annotation of the subcellular localization, when available. TPpred was also trained and tested to predict the cleavage site of the organelle-targeting peptide: on this task, the average error of TPpred on mitochondrial and plastidic proteins is 7 and 15 residues, respectively. This value is lower than the error reported by other methods currently available. The TPpred datasets are available at http://biocomp.unibo.it/valentina/TPpred/. TPpred is available on request from the authors. Supplementary data are available at Bioinformatics online.

Full Text