Abstract
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
Highlights
SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma membrane in prokaryotes
We opted for the bidirectional encoder representations from transformers (BERT) protein language models (LMs), which is available in ProtTrans[6,7] and was trained on UniRef[100] (Fig. 1b)
Most of the competing single-class predictors considered for benchmarking are optimized for detecting their respective SP type in a dataset of true and non-SP sequences; MCC1 best captures their performance on the task they were designed for
Summary
SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma (inner) membrane in prokaryotes. We combined the LM with a conditional random field (CRF) probabilistic model[14] to predict the SP region at each sequence position together with the SP type, yielding the SignalP 6.0 architecture (Fig. 1d). This confirms for the two underrepresented types, Sec/SPIII and Tat/SPII
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have