Performance Optimization of a Parallel Error Correction Tool

Marco Martínez-Sánchez,Juan Touriño,Roberto R Expósito

doi:10.3390/engproc2021007034

Abstract

Due to the continuous development in the field of Next Generation Sequencing (NGS) technologies that have allowed researchers to take advantage of greater genetic samples in less time, it is a matter of relevance to improve the existing algorithms aimed at the enhancement of the quality of those generated reads. In this work, we present a Big Data tool implemented upon the open-source Apache Spark framework that is able to execute validated error-correction algorithms at an improved performance. The experimental evaluation conducted on a multi-core cluster has shown significant improvements in execution times, providing a maximum speedup of 9.5 over existing error correction tools when processing an NGS dataset with 25 million reads.

Highlights

In recent years, the development of effective and fast techniques for processing large volumes of genetic data has gained relevance due to the need of counting on these reads for the evolution of biology-related scientific fields
Significant progress has been made in the Big Data field, where for many years some of the main approaches were based on the MapReduce paradigm [5], a programming model proposed by Google that defines multiple programmable and nonprogrammable phases to decouple the data transformation logic from the communication and load distribution tasks
Some alternatives have been proposed with this goal in mind, as it is with the Apache Spark framework [6], that are able to relieve both the data scientists and Big Data developers from directly operating with the MapReduce framework and allow them to tackle with a higher-level API

Summary

Introduction

The development of effective and fast techniques for processing large volumes of genetic data has gained relevance due to the need of counting on these reads for the evolution of biology-related scientific fields. CloudEC [3] is a Big Data tool built upon the Apache Hadoop framework [4] that is able to perform corrections to genetic datasets by running multiple steps of alignments of the input samples, and replacing the bases with the lowest qualities of all those aligned samples with another representations of higher quality.

Results

Conclusion