Abstract

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

Highlights

  • Continuous advances in generation sequencing (NGS) technologies have led to a constant production of huge amounts of genomic data

  • The results for the original MetaCache version are closer to the expected data, but the results obtained with the message passing interface (MPI) version are almost the same, having very small differences

  • Results obtained with the MPI version are almost equivalent to the sequential version while, at the same time, the memory consumption and execution time are smaller using MetaCacheMPI

Read more

Summary

Introduction

Continuous advances in generation sequencing (NGS) technologies have led to a constant production of huge amounts of genomic data. Exascale computing refers to supercomputers capable of executing 1018 floating point operations per second (FLOPs), i.e., one exaFLOP per second To reach this performance, future supercomputers require data delivery to be fast and efficient, both from memory and disk, and across the network and between processors. Developers will need exascale Application Programming Interfaces (APIs) to facilitate the exploitation of exceptional amounts of parallelism in applications, to enable the processing of significant amounts of data, and to support different architectures, including those based upon heterogeneous cores or accelerators. Those APIs and their implementations will need to carefully manage different kinds of memories within each node. Exascale execution software systems will need to ensure that jobs continue to run despite the occurrence of system failures and other kinds of hardware or software errors

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call