Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Silvina Caino-Lores,Tom Peterka,Jesus Carretero,Bogdan Nicolae,Orcun Yildiz

doi:10.1109/access.2019.2949836

Silvina Caino-Lores, Tom Peterka + Show 3 more

Open Access

https://doi.org/10.1109/access.2019.2949836

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 128	License type: CC BY 4.0

Affiliation: Argonne National Laboratory

Abstract

Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.

Highlights

Convergence between high-performance computing (HPC) and big data analytics (BDA) is an established research area that has spawned new research topics such as data-intensive scientific computing, high-performance data analytics, and hybrid platforms and infrastructures based on virtualization techniques and novel storage hierarchies
We summarize our contributions as follows: 1) A definition of a generic unified distributed data abstraction (UDDA) and its associated unified operational model (UOM), which sets the foundation of a theoretical frame for the analysis and definition of composite HPC-BDA applications
The rest of this paper is organized as follows: Sections II and III introduce the BDA and HPC ecosystems, respectively, and develop on their current state; Section IV presents relevant works related to the HPC-BDA convergence problem; Section V analyzes the challenges and opportunities of the convergence of such paradigms; Section VI details the proposal of an abstract architecture suitable for the interoperation of process- and data-centric platforms, which is later implemented in Section VII, using Apache Spark and a communication library built on Message Passing Interface (MPI), and evaluated in Section VIII on a real use case from the hydrogeology domain; and Section IX summarizes this work, its applications, and directions for future research

Summary

INTRODUCTION

Convergence between high-performance computing (HPC) and big data analytics (BDA) is an established research area that has spawned new research topics such as data-intensive scientific computing, high-performance data analytics, and hybrid platforms and infrastructures based on virtualization techniques and novel storage hierarchies. The rest of this paper is organized as follows: Sections II and III introduce the BDA and HPC ecosystems, respectively, and develop on their current state; Section IV presents relevant works related to the HPC-BDA convergence problem; Section V analyzes the challenges and opportunities of the convergence of such paradigms; Section VI details the proposal of an abstract architecture suitable for the interoperation of process- and data-centric platforms, which is later implemented, using Apache Spark and a communication library built on MPI, and evaluated in Section VIII on a real use case from the hydrogeology domain; and Section IX summarizes this work, its applications, and directions for future research The rest of this paper is organized as follows: Sections II and III introduce the BDA and HPC ecosystems, respectively, and develop on their current state; Section IV presents relevant works related to the HPC-BDA convergence problem; Section V analyzes the challenges and opportunities of the convergence of such paradigms; Section VI details the proposal of an abstract architecture suitable for the interoperation of process- and data-centric platforms, which is later implemented in Section VII, using Apache Spark and a communication library built on MPI, and evaluated in Section VIII on a real use case from the hydrogeology domain; and Section IX summarizes this work, its applications, and directions for future research

BIG DATA ANALYTICS ECOSYSTEM

CURRENT TRENDS IN HPC AND BDA CONVERGENCE

CONVERGENCE CHALLENGES AND OPPORTUNITIES

IMPLEMENTATION OF THE ARCHITECTURE

VIII. USE CASE

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A review of sector-specific big data analytics models
Roimah Dollah ... Hazleen Aris
-
Roimah Dollah, et. al.Roimah Dollah ... Hazleen Aris
01 Nov 2017
01 Nov 2017

Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation
Ajeet Ram Pathak ... Siddharth S Rautaray
Cluster Computing | VOL. 23
Ajeet Ram Pathak, et. al.Ajeet Ram Pathak ... Siddharth S Rautaray
03 Aug 2019
Cluster Computing | VOL. 23

HPC and the Big Data challenge
Violeta Holmes ... Matthew Newall
Safety and Reliability | VOL. 36
Violeta Holmes, et. al.Violeta Holmes ... Matthew Newall
02 Jul 2016
Safety and Reliability | VOL. 36

Contemporary High-Performance Computing for Big Data Applications
S Ayyasamy
Journal of Information Technology and Digital World | VOL. 5
S AyyasamyS Ayyasamy
01 Dec 2023
Journal of Information Technology and Digital World | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access