Abstract

Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error correction tool BAYESHAMMER. While BAYESHAMMER was designed for single-cell sequencing, we demonstrate that it also improves on existing error correction tools for multi-cell sequencing data while working much faster on real-life datasets. We benchmark BAYESHAMMER on both k-mer counts and actual assembly results with the SPADES genome assembler.

Highlights

  • Single-cell sequencing [1,2] based on the Multiple Displacement Amplification (MDA) technology [1,3] allows one to sequence genomes of important uncultivated bacteria that until recently had been viewed as unamenable to genome sequencing

  • We introduce the BAYESHAMMER error correction tool that does not rely on uniform coverage

  • Paired-end libraries were generated by an Illumina Genome Analyzer IIx from MDAamplified single-cell DNA and from multicell genomic DNA prepared from cultured E. coli, respectively These datasets consist of 100 bp paired-end reads with insert size 220; both E. coli datasets have average coverage ≈ 600×, the coverage is highly non-uniform in the single-cell case

Read more

Summary

Introduction

Single-cell sequencing [1,2] based on the Multiple Displacement Amplification (MDA) technology [1,3] allows one to sequence genomes of important uncultivated bacteria that until recently had been viewed as unamenable to genome sequencing. Existing metagenomic approaches (aimed at genes rather than genomes) are clearly limited for studies of such bacteria despite the fact that they represent the majority of species in such important studies as the Human Microbiome Project [4,5] or discovery of new antibiotics-producing bacteria [6]. Single-cell sequencing datasets have extremely nonuniform coverage that may vary from ones to thousands along a single genome (Figure 1). For many existing error correction tools, most notably QUAKE [7], uniform coverage is a prerequisite: in the case of non-uniform coverage they either do not work or produce poor results. Error correction tools often employ a simple idea of discarding rare k-mers, which

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call