HPTMT Parallel Operators for High Performance Data Science and Data Engineering.

Vibhatha Abeykoon,Niranda Perera,Thejaka Amila Kanewala,Supun Kamburugamuve,Chathura Widanage,Ahmet Uyar,Gregor Von Laszewski,Geoffrey Fox

doi:10.3389/fdata.2021.756041

Vibhatha Abeykoon, Niranda Perera + Show 6 more

Open Access

https://doi.org/10.3389/fdata.2021.756041

Copy DOI

Abstract

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface.

Highlights

Data engineering and data science are two major branches of data-intensive applications
Let us look at Cylon and Deep Learning frameworks and see how they can work together according to the HPTMT architecture
We proposed the HPTMT architecture that defines an operator and execution model for scaling dataintensive applications

Summary

INTRODUCTION

Data engineering and data science are two major branches of data-intensive applications. The likes of Dask and Cylon further enhances the ability to such computations in parallel to support computation intensive jobs On top of these computations systems such as PyTorch and Tensorflow allows to run complex mathematical models based on machine learning or deep learning algorithms. This paper will showcase the importance of HPTMT architecture through an application that uses various data abstractions in a single distributed environment to compose a rich application. It highlights the scalability of the architecture and its applicability to high-performance computing systems.

HPTMT ARCHITECTURE

Principles

Operators

Distributed Execution

HPTMT FRAMEWORKS

Deep Learning Frameworks

Deep Learning and Data Engineering

UNOMT APPLICATION

Background

Deep Learning Component

Data Engineering Component

PERFORMANCE EVALUATION

Sequential Execution Performance

Distributed Execution Performance

Deep Learning Execution

RELATED WORK

LIMITATIONS AND FUTURE

CONCLUSION

Findings

DATA AVAILABILITY STATEMENT

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in big data	Publication Date: Feb 7, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

HPTMT Parallel Operators for High Performance Data Science and Data Engineering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in big data

Lead the way for us

Similar Papers

The Dual Half-Arc Data Structure: Towards the Universal B-rep Data Structure
François Anton ... Pawel Boguslawski
-
François Anton, et. al.François Anton ... Pawel Boguslawski
01 Jan 2014
01 Jan 2014

Artificial intelligence in interdisciplinary life science and drug discovery research.
Jürgen Bajorath
Future science OA | VOL. 8
Jürgen BajorathJürgen Bajorath
08 Mar 2022
Future science OA | VOL. 8

BPA: A Bitmap-Prefix-tree Array data structure for frequent closed pattern mining
Jugkarin Wachiramethin ... Jeeraporn Werapun
-
Jugkarin Wachiramethin, et. al.Jugkarin Wachiramethin ... Jeeraporn Werapun
01 Jul 2009
01 Jul 2009

Data Science and Interdisciplinary Research: Recent Trends and Applications
-
-
--
25 Sep 2023
25 Sep 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HPTMT Parallel Operators for High Performance Data Science and Data Engineering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in big data