Abstract

Cystine-stabilized peptides have great utility as they naturally block ion channels, inhibit acetylcholine receptors, or inactivate microbes. However, only a tiny fraction of these peptides has been characterized. Exploration for novel peptides most efficiently starts with the identification of candidates from genome sequence data. Unfortunately, though cystine-stabilized peptides have shared structures, they have low DNA sequence similarity, restricting the utility of BLAST and even more powerful sequence alignment-based annotation algorithms, such as PSI-BLAST and HMMER. In contrast, a supervised machine learning approach may improve discovery and function assignment of these peptides. To this end, we employed our previously described m-NGSG algorithm, which utilizes hidden signatures embedded in peptide primary sequences that define and categorize structural or functional classes of peptides. From the generalized m-NGSG framework, we derived five specific models that categorize cystine-stabilized peptide sequences into specific functional classes. When compared with PSI-BLAST, HMMER and existing function-specific models, our novel approach (named CSPred) consistently demonstrates superior performance in discovery and function-assignment. We also report an interactive version of CSPred, available through download (https://bitbucket.org/sm_islam/cystine-stabilized-proteins/src) or web interface (watson.ecs.baylor.edu/cspred), for the discovery of cystine-stabilized peptides of specific function from genomic datasets and for genome annotation. We fully describe, in the Availability section following the Discussion, the quick and simple usage of the CsPred website to automatically deliver function assignments for batch submissions of peptide sequences.

Highlights

  • Cystine-stabilized peptides are impressively abundant and widespread across the taxa

  • The positive and negative datasets for ion channel blockers (ICB), antimicrobial peptides (AMP), acetylcholine receptor inhibitors (ACRI), serine protease inhibitors (SPI), and hemolytic proteins (HLP) are generated by obtaining protein sequences from UniprotKB44 using the search keys mentioned in Supplement Table 1

  • From the positive and negative datasets of each selected functional group, 90% of the chains are retained for training sets, while 10% of the chains are reserved for out-of-sample test sets using a random shuffle-split process

Read more

Summary

Introduction

Cystine-stabilized peptides are impressively abundant and widespread across the taxa. Machine learning-based supervised models are widely used to predict the functional and structural class of proteins which are difficult to predict using sequence alignment-based algorithms. We demonstrated a complete pipeline of a classifier constructor where the feature generation model is integrated with a logistic regression algorithm[38] This training set pipeline is denoted as m-NGSG (modified n-gram and skip-gram) where a modified n-grams39and skip-grams-based[40] framework is used to generate descriptors from the protein sequences and utilize the hidden signatures from the descriptors for the supervised classification[41]. We applied m-NGSG to build five individual models to predict ion channel blockers, antimicrobial peptides, acetylcholine receptor inhibitors, serine protease inhibitors, and hemolytic proteins from disulfide stabilized proteins. We present three classifiers that assign ion channel blockers into three subclasses, sodium, potassium and calcium channel blockers

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call