SignalP 6.0 predicts all five types of signal peptides using protein language models

Felix Teufel,Ole Winther,Magnús Halldór Gíslason,Henrik Nielsen,José Juan Almagro Armenteros,Silas Irby Pihl,Alexander Rosenberg Johansen,Søren Brunak,Konstantinos D Tsirigos,Gunnar Von Heijne

doi:10.1038/s41587-021-01156-3

Abstract

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

Highlights

SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma membrane in prokaryotes
We opted for the bidirectional encoder representations from transformers (BERT) protein language models (LMs), which is available in ProtTrans[6,7] and was trained on UniRef[100] (Fig. 1b)
Most of the competing single-class predictors considered for benchmarking are optimized for detecting their respective SP type in a dataset of true and non-SP sequences; MCC1 best captures their performance on the task they were designed for

Summary

Introduction

SPs are short N-terminal amino acid sequences that target proteins to the secretory (Sec) pathway in eukaryotes and for translocation across the plasma (inner) membrane in prokaryotes. We combined the LM with a conditional random field (CRF) probabilistic model[14] to predict the SP region at each sequence position together with the SP type, yielding the SignalP 6.0 architecture (Fig. 1d). This confirms for the two underrepresented types, Sec/SPIII and Tat/SPII

Results

Conclusion