Protein embedding based alignment

Benjamin Giovanni Iovino,Yuzhen Ye

doi:10.1186/s12859-024-05699-5

Abstract

PurposeDespite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.MethodsWe tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.ResultsPEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.ConclusionOur results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 28, 2024
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Protein embedding based alignment

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions
Yao-Ming Huang ... Christopher Bystroff
Bioinformatics | VOL. 22
Yao-Ming Huang, et. al.Yao-Ming Huang ... Christopher Bystroff
13 Dec 2005
Bioinformatics | VOL. 22

Trees, Stars, and Multiple Biological Sequence Alignment
Stephen F Altschul ... David J Lipman
SIAM Journal on Applied Mathematics | VOL. 49
Stephen F Altschul, et. al.Stephen F Altschul ... David J Lipman
01 Feb 1989
SIAM Journal on Applied Mathematics | VOL. 49

High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABERTOOTH
Florian Teichert ... Jonas Minning
BMC Bioinformatics | VOL. 11
Florian Teichert, et. al.Florian Teichert ... Jonas Minning
14 May 2010
BMC Bioinformatics | VOL. 11

Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework
Kazutaka Katoh ... Hiroyuki Toh
BMC Bioinformatics | VOL. 9
Kazutaka Katoh, et. al.Kazutaka Katoh ... Hiroyuki Toh
25 Apr 2008
BMC Bioinformatics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Protein embedding based alignment

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics