Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Shibu Yooseph,Granger Sutton,Weizhong Li

doi:10.1186/1471-2105-9-182

Shibu Yooseph, Granger Sutton + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-9-182

Copy DOI

Abstract

BackgroundThe identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.ResultsWe present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net).ConclusionThe clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.

Highlights

The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities
We present a computational improvement to a sequence clustering method that we introduced previously to analyze large microbial metagenomic datasets, and that was used in the Global Ocean Sampling (GOS) study [9]
We called 377,570 Open Reading Frames (ORFs) on these reads and this was made available as input to our incremental clustering method

Summary

Methodology article

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering Shibu Yooseph*†1, Weizhong Li†2 and Granger Sutton. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA and 2California Institute for Telecommunications and Information Technology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA. Published: 10 April 2008 BMC Bioinformatics 2008, 9:182 doi:10.1186/1471-2105-9-182

Results

Conclusion

Background

Results and Discussion

Eisen JA: Environmental Shotgun Sequencing

30. Yang Z

35. Edgar RC

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Apr 10, 2008
Citations: 89	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

An incremental clustering method based on the boundary profile.
Junpeng Bao ... Yong Deng
PloS one | VOL. 13
Junpeng Bao, et. al.Junpeng Bao ... Yong Deng
20 Apr 2018
PloS one | VOL. 13

Cloud4NFICA-Nearness Factor-Based Incremental Clustering Algorithm Using Microsoft Azure for the Analysis of Intelligent Meter Data
Archana Yashodip Chaudhari ... Preeti Mulay
International Journal of Information Retrieval Research | VOL. 10
Archana Yashodip Chaudhari, et. al.Archana Yashodip Chaudhari ... Preeti Mulay
01 Apr 2020
International Journal of Information Retrieval Research | VOL. 10

Cloud4NFICA-Nearness Factor-Based Incremental Clustering Algorithm Using Microsoft Azure for the Analysis of Intelligent Meter Data
Archana Yashodip Chaudhari ... Preeti Mulay
-
Archana Yashodip Chaudhari, et. al.Archana Yashodip Chaudhari ... Preeti Mulay
01 Jan 2021
01 Jan 2021

Incremental Clustering for Time Series Data Based on an Improved Leader Algorithm
Huynh Thi Thu Thuy ... Duong Tuan Anh
-
Huynh Thi Thu Thuy, et. al.Huynh Thi Thu Thuy ... Duong Tuan Anh
01 Mar 2019
01 Mar 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics