Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Weizhong Li

doi:10.1186/1471-2105-10-359

Abstract

BackgroundThe remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand.ResultsThe new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes".ConclusionRAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from .

Highlights

The remarkable advance of metagenomics presents significant new challenges in data analysis
The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) presented aims to address the particular computational challenges imposed by the huge size and great diversity of metagenomic data
The protein analysis of the Global Ocean Sampling (GOS) study[2] took more than one million CPU hours

Summary

Introduction

The remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic data consists of enormous numbers of fragmented sequences that challenge data analysis methodologically and computationally. To address these challenges, new methods and resources have been developed, such as simulated datasets[10], IMG/M[11], CAMERA[12], MG-RAST[13], taxonomy tools[14,15], statistical comparison[16], functional diversity analysis[17], binning [18,19,20] and so on. The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) presented aims to address the particular computational challenges imposed by the huge size and great diversity of metagenomic data. The protein analysis of the Global Ocean Sampling (GOS) study[2] took more than one million CPU hours

Methods

Results

Conclusion