Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Thomas J Sharpton,Jonathan A Eisen,Katherine S Pollard,Dongying Wu,Guillaume Jospin,Morgan Gi Langille

doi:10.1186/1471-2105-13-264

Abstract

BackgroundNew computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.ResultsWe implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as “Sifting Families,” or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology–based analyses.ConclusionsWe describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

Highlights

New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects
With the goal of maximizing the coverage of protein family diversity, we developed a high-throughput procedure for identifying protein families that focuses on efficiency, automation, and updatability
Our procedure uses the Markov clustering algorithm (MCL) with pairwise BLASTP percent sequence identity as a distance measure to group protein sequences into families

Summary

Introduction

New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. Protein databases differ in a variety of ways, including how protein families are defined, the phylogenetic diversity represented in the database, the method by which protein families are identified, and the format in which family data is encoded. These resources have proven to be critical components of genome and metagenome annotation [6,18], evolutionary analysis (e.g., the evolution of biological functions [19]), and community ecology (e.g., the functional assortment of communities [20]). Maintaining an up-to-date catalog of protein family diversity is important to the development of these research fields

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 13, 2012
Citations: 54	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Protein co-evolution, co-adaptation and interactions
Florencio Pazos ... Alfonso Valencia
The EMBO Journal | VOL. 27
Florencio Pazos, et. al.Florencio Pazos ... Alfonso Valencia
25 Sep 2008
The EMBO Journal | VOL. 27

Protein Family Databases
Nicola J Mulder
-
Nicola J MulderNicola J Mulder
15 Oct 2012
15 Oct 2012

Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences
Monique Ohanian ... Robyn Otway
Journal of the American Heart Association | VOL. 1
Monique Ohanian, et. al.Monique Ohanian ... Robyn Otway
26 Sep 2012
Journal of the American Heart Association | VOL. 1

Glossary
Fran Lewitter ... Janet M Thornton
Trends in Biotechnology | VOL. 16
Fran Lewitter, et. al.Fran Lewitter ... Janet M Thornton
01 Nov 1998
Trends in Biotechnology | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics