Abstract

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.

Highlights

  • High-throughput sequencing technologies have made a huge impact on microbiology, providing a rapid and cost-effective way of generating draft genomes and allowing metagenomic exploration of microbial diversity

  • The National Institute of Health (NIH) roadmap Human Microbiome Project (HMP) initiative was jump-started to examine microbes associated with health and disease in several areas of the human body [7,8]

  • Sequence contamination is a serious concern to the quality of genomic and metagenomic data used for downstream analysis

Read more

Summary

Introduction

High-throughput sequencing technologies have made a huge impact on microbiology, providing a rapid and cost-effective way of generating draft genomes and allowing metagenomic exploration of microbial diversity. Metagenomics, the survey of microbial or viral communities (and their encoded metabolic activities) from distinct environments, has been rapidly expanding over the past several years from its origins in environmental microbiology [1,2,3,4,5,6]. Metagenomics has been enabled by the advances in secondgeneration sequencing, with current sequencing machines generating reads that are shorter than those generated with gel-capillary technology. The performance characteristics of high-throughput sequencing machines such as Roche/454’s GS FLX, Illumina/Solexa’s GA IIx, and Life Technologies SOLiD system are changing rapidly with respect to machine capacity, run time, read length, error profile, and cost per base

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call