Abstract

The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/index.php.

Highlights

  • The comparisons of completed bacterial genome sequences of closely related species have revealed significant genome variations between pathogenic and nonpathogenic bacteria [1]

  • One of the major differences between pathogenic and nonpathogenic bacteria is the presence of virulence-related genes in the former

  • The accuracies and Mathews Correlation Coefficient (MCC) values of both the modules were almost similar at default threshold of zero; the sensitivity (76.12%) of dipeptide composition based module (Table 1) was much higher as compared to the sensitivity (63.20%) of Amino Acid Composition (AAC) based modules (Table S1 and Figure S2 in File S1)

Read more

Summary

Introduction

The comparisons of completed bacterial genome sequences of closely related species have revealed significant genome variations between pathogenic and nonpathogenic bacteria [1]. One of the major differences between pathogenic and nonpathogenic bacteria is the presence of virulence-related genes in the former. These virulence genes could be present on bacterial plasmids or chromosomes, sometimes as pathogenicity islands, and are absent in nonpathogenic strains of the same or closely related species [2]. Another study indicated that the differences in the capsular proteins in the pathogenic Cryptococcus species and environmental species influence their ability to cause virulence [5]. The sequence analysis of the pathogenic and nonpathogenic Entamoeba histolytica revealed significant evolutionary divergence and indicated that the pathogenic isolates are genetically distinct from the nonpathogenic isolates [6]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call