Big Data in metagenomics: Apache Spark vs MPI.

José M Abuín,Bertil Schmidt,Nuno Lopes,Tomás F Pena,Luís Ferreira

doi:10.1371/journal.pone.0239741

Abstract

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

Highlights

Continuous advances in generation sequencing (NGS) technologies have led to a constant production of huge amounts of genomic data
The results for the original MetaCache version are closer to the expected data, but the results obtained with the message passing interface (MPI) version are almost the same, having very small differences
Results obtained with the MPI version are almost equivalent to the sequential version while, at the same time, the memory consumption and execution time are smaller using MetaCacheMPI

Summary

Introduction

Continuous advances in generation sequencing (NGS) technologies have led to a constant production of huge amounts of genomic data. Exascale computing refers to supercomputers capable of executing 1018 floating point operations per second (FLOPs), i.e., one exaFLOP per second To reach this performance, future supercomputers require data delivery to be fast and efficient, both from memory and disk, and across the network and between processors. Developers will need exascale Application Programming Interfaces (APIs) to facilitate the exploitation of exceptional amounts of parallelism in applications, to enable the processing of significant amounts of data, and to support different architectures, including those based upon heterogeneous cores or accelerators. Those APIs and their implementations will need to carefully manage different kinds of memories within each node. Exascale execution software systems will need to ensure that jobs continue to run despite the occurrence of system failures and other kinds of hardware or software errors

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Oct 6, 2020
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Big Data in metagenomics: Apache Spark vs MPI.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Big Data in metagenomics: Apache Spark vs MPI
Nuno Lopes ... Luís Ferreira
-
Nuno Lopes, et. al.Nuno Lopes ... Luís Ferreira
06 Oct 2020
06 Oct 2020

OpenMP vs. MPI on a shared memory multiprocessor
...
Advances in Parallel Computing | VOL. 13
, et. al. ...
01 Jan 2004
Advances in Parallel Computing | VOL. 13

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI
Kinan Al-Attar ... Aamir Shafi
-
Kinan Al-Attar, et. al.Kinan Al-Attar ... Aamir Shafi
01 Sep 2022
01 Sep 2022

Development of an MPI for Power System Distributed Parallel Computing in Wide Area Network with P2P Technology
Qi Huang ... Jianbo Yi
-
Qi Huang, et. al. Qi Huang ... Jianbo Yi
13 Sep 2006
13 Sep 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Big Data in metagenomics: Apache Spark vs MPI.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE