Abstract

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

Highlights

  • SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma membrane in prokaryotes

  • We opted for the bidirectional encoder representations from transformers (BERT) protein language models (LMs), which is available in ProtTrans[6,7] and was trained on UniRef[100] (Fig. 1b)

  • Most of the competing single-class predictors considered for benchmarking are optimized for detecting their respective SP type in a dataset of true and non-SP sequences; MCC1 best captures their performance on the task they were designed for

Read more

Summary

Introduction

SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma (inner) membrane in prokaryotes. We combined the LM with a conditional random field (CRF) probabilistic model[14] to predict the SP region at each sequence position together with the SP type, yielding the SignalP 6.0 architecture (Fig. 1d). This confirms for the two underrepresented types, Sec/SPIII and Tat/SPII

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call