Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space.

Richard Wilton,Tamas Budavari,Ben Langmead,Steven L Salzberg,Alexander S Szalay,Sarah J Wheelan

doi:10.7717/peerj.808

Abstract

When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU) hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Highlights

The cost and throughput of DNA sequencing have improved rapidly in the past several years (Glenn, 2011), with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 (Hayden, 2014)
The first and usually the most time-consuming step in analyzing such datasets is read alignment, the process of determining the point of origin of each sequencing read with respect to a reference genome
We evaluated published results for a number of CPU-based and graphics processing unit (GPU)-based read aligners (Supplementary Table T1) and identified four whose speed or sensitivity made them candidates for direct comparison with the Arioc implementation

Summary

INTRODUCTION

The cost and throughput of DNA sequencing have improved rapidly in the past several years (Glenn, 2011), with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 (Hayden, 2014). They are well-suited to software implementations where computations on many thousands of data items can be carried out independently in parallel This characteristic has inspired a number of attempts to develop high-throughput read aligners that use GPU acceleration. The salient problem in engineering a GPU-accelerated read aligner is that the most biologically relevant sequence-alignment algorithm (Smith & Waterman, 1981; Gotoh, 1982) is memory-intensive and involves dynamic programming dependencies that are awkward to compute efficiently in parallel This consideration has militated against the development of parallel-threaded GPU implementations (Khajeh-Saeed, Poole & Perot, 2010) where multiple threads of execution cooperate to compute a single alignment. Seed-coverage prioritization At run time, Arioc implements a heuristic that prioritizes alignments where a read contains two or more seeds that map to adjacent or nearby locations in the reference.

METHODS

RESULTS

DISCUSSION