Fast and efficient short read mapping based on a succinct hash index

Haowen Zhang,Bertil Schmidt,Weiguo Liu,Kaichao Fan,Yuandong Chan

doi:10.1186/s12859-018-2094-5

Abstract

BackgroundVarious indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time.ResultsWe present the succinct hash index – a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3.ConclusionsThe presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM.

Highlights

Various indexing techniques have been applied by generation sequencing read mapping tools
Variable-length seeding To tolerate indels, we propose variable-length seeding as another novel seed selection algorithm
Based on Lemma 2, we propose the basic idea of naive variable-length algorithm consisting of three steps: 1 We estimate the frequency of each seed of length k by accumulating the frequencies of its lstep sub-seeds

Summary

Introduction

Various indexing techniques have been applied by generation sequencing read mapping tools. Up to billions of short reads can be quickly and cheaply generated by these platforms in a single run, which in turn increases the computational burden of genomic data analysis. The first step of most associated pipelines is the mapping of the generated reads to a reference genome. Best-mappers use some heuristic methods for identifying one or a few top mapping locations for each read. These heuristic strategies can lead to a significant improvement in speed. For some specific applications, such as CHIP-seq experiments [9], copy number variation and RNA-seq transcript abundance quantification [10], it is often more desirable to use all-mappers to identify all mapped locations of each read. We focus on designing an efficient and scalable all-mapper algorithm

Methods

Results

Conclusion