Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

Jordà Polo,Zoran Jaks̆Ić,Nicola Cadenelli,David Carrera

doi:10.1016/j.future.2018.11.028

Abstract

The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are becoming more complex and more computationally demanding. For this reasons they have been extended to support different computing architectures, such as GPUs and FPGAs, to leverage the form of parallelism typical of each of such architectures.The paper describes how a genomic workload such as k-mer frequency counting that takes advantage of a GPU can be offloaded to one or even more FPGAs. Moreover, it performs a comprehensive analysis of the FPGA acceleration comparing its performance to a non-accelerated configuration and when using a GPU. Lastly, the paper focuses on how, when using accelerators with a throughput-oriented workload, one should also take into consideration both kernel execution time and how well each accelerator board overlaps kernels and PCIe transferred.Results show that acceleration with two FPGAs can improve both time- and energy-to-solution for the entire accelerated part by a factor of 1.32x. Per contra, acceleration with one GPU delivers an improvement of 1.77x in time-to-solution but of a lower 1.49x in energy-to-solution due to persistently higher power consumption. The paper also evaluates how future FPGA boards with components (i.e., off-chip memory and PCIe) on par with those of the GPU board could provide an energy-efficient alternative to GPUs.

Highlights

The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine
We describe how the OpenCL GPU algorithm for k-mers generation and shuffling was redesigned from scratch to run in FPGAs using a multi-kernel approach efficiently
We presented the result of porting and optimizing a k-mer frequency counting workload from GPUs to FPGAs boards

Summary

Introduction

The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. The porting of an application from one architecture to another requires a consistent refactoring of the offloaded code to adopt device-specific optimizations. Discrete FPGA boards are, as of today, a still younger product that is catching up and that usually offers much less performing memory (e.g., DDR4) and a slower halfduplex connection with the host system (e.g., PCIe Gen x8 with a single copy engine). A typical input of a genomics application consists of sequenced DNA samples usually taking hundreds of GB. Such samples are stored as heavily compressed data and include short sequenced strings of DNA nucleobases called reads. As a consequence of this enormous amount of data, offloading genomics applications to accelerator like FPGAs and GPUs has become a common trend

Methods

Results

Conclusion