Abstract

The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.

Highlights

  • Traditional and classical methods of genomics and microbiology allow researchers to study an individual microbial species obtained from the environment by isolating the organism into pure colonies using microbial culture techniques

  • To identify which of the K candidate genomes in the scoring matrix are truly contained in the metagenomic sample, we propose a statistical framework to model the matches between the reads and reference sequences

  • Results for Simulation Study 2 For the CARMA3 evaluation dataset, the results based on TAMER and MEGAN are listed in Table 2, where we list the results of CARMA3 which are reported in the original paper [10]

Read more

Summary

Introduction

Traditional and classical methods of genomics and microbiology allow researchers to study an individual microbial species obtained from the environment by isolating the organism into pure colonies using microbial culture techniques This approach cannot capture the structure of the broader microbial community within the environmental sample, the relative representation of multiple genomes, and their interaction with each other and with the environment. Next-generation sequencers, e.g., Illumina/Solexa, Applied Biosystems’ SOLiD, and Roche’s 454 Life Sciences sequencing systems, have emerged as the future of genomics with incredible ability to rapidly generate large amounts of sequence data [3,4] These new technologies greatly facilitate highthroughput while lowering the cost of metagenomic studies. Illumina/Solexa and SOLiD generate reads ranging between 35– 100 base pairs while Roche 454 reads are approximately 100–400 base pairs in length

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call