NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

Jamison M Mccorrison,Indresh Singh,Pratap Venepally,Roger S Lasken,Barbara A Methé,Derrick E Fouts

doi:10.1186/s12859-014-0357-3

Jamison M Mccorrison, Indresh Singh + Show 4 more

Open Access

https://doi.org/10.1186/s12859-014-0357-3

Copy DOI

Journal: BMC bioinformatics	Publication Date: Nov 19, 2014
Citations: 41	License type: CC BY 2.0

Affiliation: J. Craig Venter Institute

Abstract

BackgroundDeep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.ResultsHere we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.ConclusionsThe normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-014-0357-3) contains supplementary material, which is available to authorized users.

Highlights

Deep shotgun sequencing on generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes
Traditional OLC bacterial assemblers like Newbler and Celera WGS prefer 40-80-fold of uniform coverage across a single genome. Both algorithms often fail during consensus resolution of genomic regions with high coverage peaks, as shown by the failed experimental single cell sample assemblies missing in Additional file 3: Table S2
The targeted method is more effective at recruiting low coverage regions resulting from single cell amplification bias in variable coverage region, including 0-fold regions

Summary

Introduction

Deep shotgun sequencing on generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. Deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases. Genomic libraries are randomly sampled from a population of molecules; this sampling is biased due to sample content and preparation Such selection bias is even more prominent when MDA is used to amplify DNA from a single cell [3,4,5]. These tools are affected by the quality and level of coverage variability in the data set and often reduce fragmentation while increasing the quantity of erroneous or duplicative contigs that may obscure sequence representing true overlaps

Results

Discussion

Conclusion