Abstract

BackgroundMetagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler.ResultsWe assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes.ConclusionsOur results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.

Highlights

  • Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms

  • The assembly of a smaller dataset consisting of reads from 30 EColi strains showed that the contigs obtainable through co-assembly of related strains are considerably shorter than those generated using isolate assemblies

  • We have evaluated metagenomic assemblies based on the accuracy of the generated contigs using alignmentbased similarity to the source genomes, contig length statistics, and the proportions of the source genomes recovered by the contigs

Read more

Summary

Introduction

Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. Metagenomics provides an unbiased view of the diversity and biological potential of microbial communities [1] and analysis of community samples from several different microbial environments has provided some key insights into the understandings of. One of the major challenges related to metagenomic processing is the assembly of short reads obtained from community samples. We have evaluated the performance of a state-of-theart Eulerian-path based sequence assembler on simulated metagenomic datasets using a read length of 36 base pairs (bp), as produced by the Solexa/Illumina sequencing technology. The datasets were meant to reflect the different complexities of real metagenomic samples [5]. They included, a low complexity dataset with one dominant organism, a high complexity dataset with no dominant organism and a medium complexity dataset having a few dominant organisms. Since the metagenomic read datasets are voluminous, we used a parallel sequence assembly algorithm (ABYSS [6]) which can be deployed on a commodity Linux cluster

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call