Abstract

The k-mers processing techniques based on partitioning of the data set on the disk using minimizer-type seeds have led to a significant reduction in memory requirements; however, it has added processes (search and distribution of super k-mers) that can be intensive given the large volume of data. This paper presents a massive parallel processing model in order to enable the efficient use of heterogeneous computation to accelerate the search of super k-mers based on seeds (minimizers or signatures). The model includes three main contributions: a new data structure called CISK for representing the super k-mers, their minimizers and two massive parallelization patterns in an indexed and compact way: one for obtaining the canonical m-mers of a set of reads and another for searching for super k-mers based on minimizers. The model was implemented through two OpenCL kernels. The evaluation of the kernels shows favorable results in terms of execution times and memory requirements to use the model for constructing heterogeneous solutions with simultaneous execution (workload distribution), which perform co-processing using the current search methods of super k -mers on the CPU and the methods presented herein on GPU. The model implementation code is available in the repository: https://github.com/BioinfUD/K-mersCL.

Highlights

  • The search of super k-mers of a genomic read is a task that demands finding the seed of each possible k-mer and compare them with each other in order to identify those contiguous k-mers that have the same minimizer [1]

  • This paper proposes a massive model of parallel processing for the search of super k-mers that allows the memory requirements and the execution times to be adequate to develop efficient heterogeneous solutions with simultaneous CPU-GPUs execution

  • A processing model was obtained that efficiently parallelizes the search of super k-mers on many-core architectures using two new algorithms of parallelization that maximize the operational intensity and a structure of data that substantially reduces the memory requirement for the representation of the output data

Read more

Summary

Introduction

The search of super k-mers of a genomic read is a task that demands finding the seed (canonical minimizer or signature) of each possible k-mer and compare them with each other in order to identify those contiguous k-mers that have the same minimizer [1]. Due to the independence of processes between reads, the search for super k-mers is a highly suitable task to be accelerated by simultaneous heterogeneous processing: the workload is partitioned to be processed simultaneously between the CPU and the GPU(s), either through a static, dynamic [2], or hybrid distribution [3] For this type of processing to be carried out efficiently it is necessary to overcome the following challenge: the search for super k-mers is a process that has a very high and unpredictable memory requirement when it is massively parallelized because the space required depends on the data generated but not on the input data.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call