Sapling: accelerating suffix array queries with learned data models.

Melanie Kirsche,Arun Das,Michael C Schatz,Robinson Peter

doi:10.1093/bioinformatics/btaa911

Melanie Kirsche, Arun Das + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btaa911

Copy DOI

Abstract

As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets. Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array's memory footprint. The source code and tutorial are available open-source at https://github.com/mkirsche/sapling. Supplementary data are available at Bioinformatics online.

Highlights

Aligning sequencing reads to a reference genome or collection of genomes is a key component of many genomic analysis pipelines, including variant calling (Nielsen et al, 2011), quantifying gene expression levels (RNA-seq) (Wang et al, 2009), identifying DNA-protein binding sites (ChIP-seq) (Park, 2009) and several others (Soon et al, 2013)
2.3 Modeling with Artificial Neural Networks (ANNs) The first method we explored for modeling the suffix array distribution was using an Artificial Neural Network (ANN) (Cybenko, 1989) to learn the true mapping T(x)
We found that increasing the width of the ANN used for each bin in the model resulted in improved performance, without adding much overhead

Summary

Introduction

Aligning sequencing reads to a reference genome or collection of genomes is a key component of many genomic analysis pipelines, including variant calling (Nielsen et al, 2011), quantifying gene expression levels (RNA-seq) (Wang et al, 2009), identifying DNA-protein binding sites (ChIP-seq) (Park, 2009) and several others (Soon et al, 2013). The exact matches are used as candidate alignment sites, and each is scored based on how well the whole read aligns in the surrounding region This heuristic has been shown to perform well in many genomic applications, and is used by a large number of leading short and long reads aligners including Star (Dobin et al, 2013), Bowtie (Langmead and Salzberg, 2012), BWA-MEM (Li, 2013), NGMLR (Sedlazeck et al, 2018) and many others. It is used as a core routine for whole genome alignment (Marçais et al, 2018) and many other applications (Altschul et al, 1990)

Methods

Results

Conclusion