Abstract

BackgroundThe identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.ResultsWe present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net).ConclusionThe clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.

Highlights

  • The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities

  • We present a computational improvement to a sequence clustering method that we introduced previously to analyze large microbial metagenomic datasets, and that was used in the Global Ocean Sampling (GOS) study [9]

  • We called 377,570 Open Reading Frames (ORFs) on these reads and this was made available as input to our incremental clustering method

Read more

Summary

Methodology article

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering Shibu Yooseph*†1, Weizhong Li†2 and Granger Sutton. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA and 2California Institute for Telecommunications and Information Technology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA. Published: 10 April 2008 BMC Bioinformatics 2008, 9:182 doi:10.1186/1471-2105-9-182

Results
Conclusion
Background
Results and Discussion
Eisen JA: Environmental Shotgun Sequencing
30. Yang Z
35. Edgar RC
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call