Rapid identification of novel protein families using similarity searches.

Matt Jeffryes,Alex Bateman

doi:10.12688/f1000research.17315.1

Abstract

Protein family databases are an important tool for biologists trying to dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating a protein family database. This comparison helps to understand whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive comparisons. This method is based upon the MinHash algorithm, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing families in Pfam in less than a second, with little loss in accuracy.

Highlights

Protein family databases are an important resource for biologists seeking to characterise the function of proteins
We chose 50 random families from Pfam, and for each of these we timed the calculation of the Jaccard index between the family and every family in Pfam, and the MinHash estimate for the Jaccard index with n values of 25, 50, 100, and 200
For the family sizes tested, calculating the Jaccard containment was faster than the Jaccard index

Summary

Introduction

Protein family databases are an important resource for biologists seeking to characterise the function of proteins. The domains, motifs and other features found in a protein form an important organisational structure that can be used to design and interpret experiments on the protein of interest. Protein family databases generally describe a particular family using a sequence profile, often in the form of a hidden Markov model (HMM)[1]. The profile HMM is a representation of the multiple sequence alignment of a number of representatives of a family. The likelihood that a given sequence is a member of a family (that is, it has homology with the other members of the family) is estimated by the probability of its alignment to this profile HMM

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: F1000Research	Publication Date: Dec 24, 2018
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Rapid identification of novel protein families using similarity searches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Similar Papers

Rapid identification of novel protein families using similarity searches
Alex Bateman ... Desmond G Higgins
F1000Research | VOL. 7
Alex Bateman, et. al.Alex Bateman ... Desmond G Higgins
11 Feb 2019
F1000Research | VOL. 7

ProClass protein family database.
H Huang ... C Xiao
Nucleic acids research | VOL. 28
H Huang, et. al.H Huang ... C Xiao
01 Jan 1999
Nucleic acids research | VOL. 28

PFDB: a generic protein family database integrating the CATH domain structure database with sequence based protein family resources.
Adrian J Shepherd ... Nigel J Martin
Bioinformatics | VOL. 18
Adrian J Shepherd, et. al.Adrian J Shepherd ... Nigel J Martin
01 Dec 2002
Bioinformatics | VOL. 18

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
Thomas J Sharpton ... Guillaume Jospin
BMC Bioinformatics | VOL. 13
Thomas J Sharpton, et. al.Thomas J Sharpton ... Guillaume Jospin
13 Oct 2012
BMC Bioinformatics | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rapid identification of novel protein families using similarity searches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research