GNUMAP 4.0: Space and Time Efficient NGS Read Mapping Using the FM-Index

M Stanley Fujimoto,Cole A Lyman,Paul M Bodily,Mark J Clement,Quinn Snell

doi:10.36959/317/680

Abstract

In this article, we present GNUMAP (Genomic Next-generation Universal MAPper), a next-generation sequence read mapper, that utilizes a Full-text index in Minute Space (FM-index) with the Burrows-Wheeler Transform (BWT) as the reference genome data structure. Using the FM-index, GNUMAP is able to map reads with a dramatic decrease in memory usage while maintaining the same probabilistic mapping characteristics of previous versions.

Highlights

Next-generation sequencing (NGS) read mapping continues to be an important process in many “-omic” analysis pipelines
We have shown the benefits of using the FM-index in place of the hashmap data structure in generation Universal MAPper (GNUMAP)
That using parameter tuning coupled with the other optimizations and bug fixes overcomes the possible run-time issues

Summary

Introduction

Next-generation sequencing (NGS) read mapping continues to be an important process in many “-omic” analysis pipelines. The Genomic Next-generation Universal MAPper (GNUMAP) is a mapping algorithm that differentiates itself from other mappers by using a probabilistic Needleman-Wunsch (NW) alignment algorithm as well as calculating a posterior probability score for multimapped reads [3]. The probabilistic NW is unique because it is able to probabilistically align reads to the reference genome by using the raw Solexa/Illumina intensity or probability files. This results in increased confidence in mapping because all possible bases at a particular position are aligned with base-specific uncertainty taken into account. The data structure used to store the indexed reference genome used for mapping can be too large to fit into memory on many machines, requiring nearly 40GB RAM for the human reference genome for certain kmer sizes

Methods

Results

Conclusion