Abstract

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface.

Highlights

  • Data engineering and data science are two major branches of data-intensive applications

  • Let us look at Cylon and Deep Learning frameworks and see how they can work together according to the HPTMT architecture

  • We proposed the HPTMT architecture that defines an operator and execution model for scaling dataintensive applications

Read more

Summary

INTRODUCTION

Data engineering and data science are two major branches of data-intensive applications. The likes of Dask and Cylon further enhances the ability to such computations in parallel to support computation intensive jobs On top of these computations systems such as PyTorch and Tensorflow allows to run complex mathematical models based on machine learning or deep learning algorithms. This paper will showcase the importance of HPTMT architecture through an application that uses various data abstractions in a single distributed environment to compose a rich application. It highlights the scalability of the architecture and its applicability to high-performance computing systems.

HPTMT ARCHITECTURE
Principles
Operators
Distributed Execution
HPTMT FRAMEWORKS
Deep Learning Frameworks
Deep Learning and Data Engineering
UNOMT APPLICATION
Background
Deep Learning Component
Data Engineering Component
PERFORMANCE EVALUATION
Sequential Execution Performance
Distributed Execution Performance
Deep Learning Execution
RELATED WORK
LIMITATIONS AND FUTURE
CONCLUSION
Findings
DATA AVAILABILITY STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.