An analysis of protein language model embeddings for fold prediction.

Amelia Villegas-Morcillo,Angel M Gomez,Victoria Sanchez

doi:10.1093/bib/bbac142

Amelia Villegas-Morcillo, Angel M Gomez + Show 1 more

Open Access

https://doi.org/10.1093/bib/bbac142

Copy DOI

Journal: Briefings in Bioinformatics	Publication Date: Apr 21, 2022
Citations: 25	License type: cc-by-nc-nd

Affiliation: University of Granada

Abstract

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An analysis of protein language model embeddings for fold prediction.

Abstract

Talk to us

Similar Papers

More From: Briefings in Bioinformatics

Lead the way for us

Similar Papers

MUSTANG-MR Structural Sieving Server: Applications in Protein Structural Analysis and Crystallography
Arun S Konagurthu ... James A Irving
PLoS ONE | VOL. 5
Arun S Konagurthu, et. al.Arun S Konagurthu ... James A Irving
06 Apr 2010
PLoS ONE | VOL. 5

The evolution of contact prediction: evidence that contact selection in statistical contact prediction is changing.
Mark Chonofsky ... Charlotte M Deane
Bioinformatics | VOL. 36
Mark Chonofsky, et. al.Mark Chonofsky ... Charlotte M Deane
06 Nov 2019
Bioinformatics | VOL. 36

Combining evolutionary information and neural networks to predict protein secondary structure.
Burkhard Rost ... Chris Sander
Proteins: Structure, Function, and Bioinformatics | VOL. 19
Burkhard Rost, et. al.Burkhard Rost ... Chris Sander
01 May 1994
Proteins: Structure, Function, and Bioinformatics | VOL. 19

Integration of Alignment and Phylogeny in the Whole-Genome Era

-

18 Jun 2015
18 Jun 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An analysis of protein language model embeddings for fold prediction.

Abstract

Talk to us

Similar Papers

More From: Briefings in Bioinformatics