Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2.

Nicola Prezza,Alberto Policriti,Francesco Vezzi,Max Käller

doi:10.1186/s12859-016-0910-3

Nicola Prezza, Alberto Policriti + Show 2 more

Open Access

https://doi.org/10.1186/s12859-016-0910-3

Copy DOI

Abstract

BackgroundBisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size.MethodsAs far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols.ResultsExperimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter.ConclusionsThe new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-0910-3) contains supplementary material, which is available to authorized users.

Highlights

Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes
We compared the performances of our tool with two of the most widely used bisulfite aligners: Bismark version 0.14.3 [4] combined with both Bowtie 1 [18] and Bowtie 2 [8], and BSMAP version 2.90 [9]
We ran a test on a simulated high-coverage dataset (Arabidopsis thaliana genome, 24.6x coverage) in order to demonstrate the correctness of our methylation caller extended randomized numerical alignEr (ERNE)-METH 2

Summary

Introduction

Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. Reads coming from highly unmethylated genomic regions are characterized by low cytosine contents (since most of the Cs are converted into Ts) This loss of genomic complexity results in a higher number, with respect to more methylated regions, of ambiguous alignments in such regions, leading to potential biases. Space is often a concern during both alignment and methylation calling phases: the tools should use light data structures (fitting in main memory), and the methylation annotations—several data fields for each cytosine on both strands—should be somehow compressed on-the-fly by the caller itself

Methods

Results

Discussion

Conclusion