Contrastive learning on protein embeddings enlightens midnight zone.

Michael Heinzinger,Maria Littmann,Ian Sillitoe,Nicola Bordin,Christine Orengo,Burkhard Rost

doi:10.1093/nargab/lqac043

Abstract

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the ‘midnight zone’ of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: NAR genomics and bioinformatics	Publication Date: Mar 31, 2022
Citations: 54	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Contrastive learning on protein embeddings enlightens midnight zone.

Abstract

Talk to us

Similar Papers

More From: NAR genomics and bioinformatics

Lead the way for us

Similar Papers

CSA: An efficient algorithm to improve circular DNA multiple alignment
Francisco Fernandes ... Luísa Pereira
BMC Bioinformatics | VOL. 10
Francisco Fernandes, et. al.Francisco Fernandes ... Luísa Pereira
23 Jul 2009
BMC Bioinformatics | VOL. 10

Computational methods for protein sequence comparison and search.
Dong Xu
Current protocols in protein science | VOL. Chapter 2
Dong XuDong Xu
01 Apr 2009
Current protocols in protein science | VOL. Chapter 2

A galaxy of folds
Vikram Alva ... Andrei N Lupas
Protein Science | VOL. 19
Vikram Alva, et. al.Vikram Alva ... Andrei N Lupas
14 Dec 2009
Protein Science | VOL. 19

Measurement of word frequencies in genomic DNA sequences based on partial alignment and fuzzy set.
Fumiya Shida ... Satoshi Mizuta
Journal of bioinformatics and computational biology | VOL. 12
Fumiya Shida, et. al.Fumiya Shida ... Satoshi Mizuta
01 Aug 2014
Journal of bioinformatics and computational biology | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Contrastive learning on protein embeddings enlightens midnight zone.

Abstract

Talk to us

Similar Papers

More From: NAR genomics and bioinformatics