Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

Iain Melvin,William Stafford Noble,Christina Leslie,Jason Weston

doi:10.1371/journal.pcbi.1001047

Abstract

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.

Highlights

Using sequence similarity between proteins to detect evolutionary relationships—protein homology detection—is one of the most fundamental and longest studied problems in computational biology
Searching a protein or DNA sequence database to find sequences that are evolutionarily related to a query is one of the foundational problems in computational biology
These database searches rely on pairwise comparisons of sequence similarity between the query and targets, but despite years of method refinements, pairwise comparisons still often fail to detect more distantly related targets

Summary

Introduction

Using sequence similarity between proteins to detect evolutionary relationships—protein homology detection—is one of the most fundamental and longest studied problems in computational biology. Because protein sequence data will always be far more abundant than highquality 3D structural data, the computational challenge is to infer evolutionarily conserved structure and function from subtle sequence similarities. Stated in purely computational terms, remote homology detection involves searching a protein database for sequences that are evolutionarily related (even remotely) to a given query sequence. Most work in this area has focused on developing more sensitive pairwise comparisons between the query and target sequences, including sequence-sequence local alignments (BLAST [1], Smith-Waterman [2]); profile-sequence (PSI-BLAST [3]) and HMM-sequence comparisons (HMMER [4]); and, most recently, profile-profile [5] and HMM-HMM (HHPred/HHSearch [6]) comparisons. Motivated by the success of Google’s PageRank algorithm, we previously developed RANKPROP [7], an algorithm that uses graph diffusion on the protein similarity network, defined on a large protein sequence database, in order to re-rank target sequences relative to the query and substantially improve remote homology detection

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS Computational Biology	Publication Date: Jan 27, 2011
Citations: 50	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Efam: an expanded, metaproteome-supported HMM profile database of viral protein families.
Ahmed A Zayed ... Richard A White Iii
Bioinformatics | VOL. 37
Ahmed A Zayed, et. al.Ahmed A Zayed ... Richard A White Iii
16 Jun 2021
Bioinformatics | VOL. 37

Predicting MoRFs in protein sequences using HMM profiles.
Ronesh Sharma ... Tatsuhiko Tsunoda
BMC Bioinformatics | VOL. 17
Ronesh Sharma, et. al.Ronesh Sharma ... Tatsuhiko Tsunoda
01 Dec 2016
BMC Bioinformatics | VOL. 17

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale
Razvan L Miclea ... Eva J I Hoeijmakers
European Radiology | VOL. 34
Razvan L Miclea, et. al.Razvan L Miclea ... Eva J I Hoeijmakers
02 Jan 2024
European Radiology | VOL. 34

Comparison Analysis of Prioritization Quality Criteria Using Paired Comparison Method of Prioritization
Nadezhda Nedashkovskaya
-
Nadezhda NedashkovskayaNadezhda Nedashkovskaya
04 Oct 2022
04 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: PLoS Computational Biology