Abstract

BackgroundThe scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.Methodology/Principal FindingsIn this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.Conclusion/SignificanceOur clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.

Highlights

  • The vast majority of microbes cannot be grown in pure cultures

  • One of our goals is to identify such Open Reading Frames (ORFs) by specific features of their clusters

  • Since the majority of Global Ocean Sampling (GOS) ORFs are partial sequences, we allowed a short sequence to be clustered with a long sequence if it was completely contained within the latter

Read more

Summary

Introduction

The vast majority of microbes cannot be grown in pure cultures. advances in sequencing technology allow us to study such microbes directly in their environment without isolation and culturing. The first leg of this trip sampled 41 locations from the northwestern Atlantic through the eastern tropical Pacific and obtained nearly 8 million environmental DNA reads Such studies, with the great scale and diversity of data, challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al, 2007 PLoS Biol 5, e16) Such datasets, by their sheer size, and by many other features, defy conventional analysis and annotation methods

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.