Prediction of Mitochondrial Proteins Using Support Vector Machine and Hidden Markov Model

Manish Kumar,Ruchi Verma,Gajendra P.S Raghava

doi:10.1074/jbc.m511061200

Abstract

Mitochondria are considered as one of the core organelles of eukaryotic cells hence prediction of mitochondrial proteins is one of the major challenges in the field of genome annotation. This study describes a method, MitPred, developed for predicting mitochondrial proteins with high accuracy. The data set used in this study was obtained from Guda, C., Fahy, E. & Subramaniam, S. (2004) Bioinformatics 20, 1785-1794. First support vector machine-based modules/methods were developed using amino acid and dipeptide composition of proteins and achieved accuracy of 78.37 and 79.38%, respectively. The accuracy of prediction further improved to 83.74% when split amino acid composition (25 N-terminal, 25 C-terminal, and remaining residues) of proteins was used. Then BLAST search and support vector machine-based method were combined to get 88.22% accuracy. Finally we developed a hybrid approach that combined hidden Markov model profiles of domains (exclusively found in mitochondrial proteins) and the support vector machine-based method. We were able to predict mitochondrial protein with 100% specificity at a 56.36% sensitivity rate and with 80.50% specificity at 98.95% sensitivity. The method estimated 9.01, 6.35, 4.84, 3.95, and 4.25% of proteins as mitochondrial in Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, mouse, and human proteomes, respectively. MitPred was developed on the above hybrid approach.

Highlights

A majority of mitochondrial proteins are synthesized in cytoplasm from where they are transported inside mitochondria
First we carried out systematic analysis of amino acid composition of both mitochondrial and non-mitochondrial proteins, and on the basis of the conclusion drawn we developed the prediction method
Prediction of Mitochondrial Proteins ing technique, support vector machine (SVM), for classifying proteins instead of the pI score used in MITOPRED

Summary

Introduction

A majority of mitochondrial proteins are synthesized in cytoplasm from where they are transported inside mitochondria. Proteins that are imported to mitochondria contain a leader sequence at the N terminus that contains all the information needed to localize to mitochondria (6) This is not true for all mitochondrial proteins. A number of methods fall under the second category where sorting signals, present on the protein itself, are used for prediction. This category includes TargetP (8), SignalP (9), and PSORT II (10) These methods are quite popular, their major limitation is that not all proteins have signals; for example, only around 25% of yeast mitochondrial proteins have “matrix-targeting signals” at the N terminus (11). Some popular methods in this category are ESLpred (12), HSLpred (13), NNPSL (7), and LOCSVMPSI (14) Their overall performance is very good, accuracy of prediction of mitochondrial proteins is much lower than for proteins in other locations. Prediction of Mitochondrial Proteins ing technique, support vector machine (SVM), for classifying proteins instead of the pI score used in MITOPRED

Methods

Results

Conclusion