Nearest neighbor search on embeddings rapidly identifies distant protein relations.

Konstantin Schütze,Michael Heinzinger,Martin Steinegger,Burkhard Rost

doi:10.3389/fbinf.2022.1033775

Abstract

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in bioinformatics	Publication Date: Nov 17, 2022
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Nearest neighbor search on embeddings rapidly identifies distant protein relations.

Abstract

Talk to us

Similar Papers

More From: Frontiers in bioinformatics

Lead the way for us

Similar Papers

Fold homology detection using sequence fragment composition profiles of proteins
Armando D Solis ... Shalom R Rackovsky
Proteins: Structure, Function, and Bioinformatics | VOL. 78
Armando D Solis, et. al.Armando D Solis ... Shalom R Rackovsky
16 Aug 2010
Proteins: Structure, Function, and Bioinformatics | VOL. 78

A fast Speaker Identification method using nearest neighbor distance
Hossein Zeinali ... Bagher Babaali
-
Hossein Zeinali, et. al.Hossein Zeinali ... Bagher Babaali
01 Oct 2012
01 Oct 2012

Expanding the nitrogen regulatory protein superfamily: Homology detection at below random sequence identity.
Lisa N Kinch ... Nick V Grishin
Proteins: Structure, Function, and Bioinformatics | VOL. 48
Lisa N Kinch, et. al.Lisa N Kinch ... Nick V Grishin
09 May 2002
Proteins: Structure, Function, and Bioinformatics | VOL. 48

Rapid and enhanced remote homology detection by cascading hidden Markov model searches in sequence space.
Swati Kaushik ... Ramanathan Sowdhamini
Bioinformatics | VOL. 32
Swati Kaushik, et. al.Swati Kaushik ... Ramanathan Sowdhamini
10 Oct 2015
Bioinformatics | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Nearest neighbor search on embeddings rapidly identifies distant protein relations.

Abstract

Talk to us

Similar Papers

More From: Frontiers in bioinformatics