Abstract

Extracting discriminative k-mers is an important and challenging problem in DNA sequence analysis with applications in metagenomics and motif discovery. Despite the availability of multiple computational tools designed for this purpose, most discriminative k-mer discovery methods suffer from long execution times and high memory usage when processing large datasets. This paper presents a novel approach for discriminative k-mer discovery in DNA sequences, which leverages streaming and sketch algorithms to reduce space complexity and expose data parallelism, enabling the use of parallel platforms for accelerating the execution of computationally-intensive operations. To assess the performance of our method, we designed and implemented two versions of the algorithm that leverage parallelization at different levels: (i) a software version tailored for multithreading and vector instructions in commodity CPUs, and (ii) a custom architecture implemented on a Field-Programmable Gate Array (FPGA) accelerator that exploits fine-grain parallelism and deep pipelining on reconfigurable logic. Experimental results show that, when mining discriminative k-mers from a set of well-known ChIP-seq sequences, our parallel software implementation executes at least 15% faster than exact-counting tools, and requires at least five times less memory when processing large datasets. More importantly, we designed a custom FPGA-based accelerator for our algorithm on a Xilinx KCU1500 board, which achieves speedups above 78x with the largest datasets, compared to our parallel software implementation. The accelerator uses less than 3% of the logic resources available on the on-board XCKU115 Kintex-7 Ultrascale FPGA, and between 12% and 70% of the memory resources, depending on the size of the dataset.

Highlights

  • Identifying discriminative k-mers is a fundamental operation in applications of DNA sequencing analysis, such as metagenomics for quantification of evolutionary relatedness [1], [2] and discriminative DNA motif discovery [3]–[6]

  • We developed a script that generates tailored register-transfer level (RTL) code according to input parameters such as counter matrix dimensions, counter bit-width, test and control dataset sizes, range of k-mer lengths, and the discrimination thresholds for heavy-hitter detection in the test stage and discriminative k-mer detection in the control stage

  • Our method uses a similar approach to FCMotif [12] and MCES [11], but adds streaming algorithms based on sketches to reduce memory usage and expose fine-grain parallelism to hardware acceleration platforms

Read more

Summary

INTRODUCTION

Identifying discriminative k-mers is a fundamental operation in applications of DNA sequencing analysis, such as metagenomics for quantification of evolutionary relatedness [1], [2] and discriminative DNA motif discovery [3]–[6]. A. Saavedra et al.: Mining Discriminative K-Mers in DNA Sequences Using Sketches and Hardware Acceleration insertion, deletion, and mutations, the problem of finding common patterns of unknown length is computationally difficult. Discriminative DNA motif discovery algorithms use discriminative k-mers as a first stage to extract the most overrepresented subsequences in the test dataset [3]–[6], [9]–[12], significantly reducing the data processed in the latter stages of the algorithm. Our approach uses a streaming algorithm based on sketches, which exposes different levels of spatial and temporal parallelism and exploits them using multi-core processors, CPU vector extensions, and a Field Programmable Gate Array (FPGA)-based accelerator. A novel sketch-based method that finds discriminative k-mers and produces the same results as exact frequency-counting algorithms over common datasets used for DNA motif discovery, with much lower memory usage.

RELATED WORK
PROPOSED APPROACH
CPU ALGORITHM FOR VECTOR INSTRUCTIONS AND MULTITHREADING
CUSTOM HARDWARE ACCELERATOR
EXPERIMENTAL EVALUATION
PERFORMANCE WITH DIFFERENT SKETCHES
DISCUSSION
Findings
VIII. CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.