Abstract

Parallel computations are essential tool in solving large-scale computationally demanding problems. Due to large diversity and heterogeneity of the currently available parallel processing techniques and paradigms it is usually difficult to find the right solution that will perform well according to every performance metric. As one of the recent developments in parallel computing Apache Spark framework allows to process petabyte-scale data and possesses properties such as fault tolerance, scalability, load balancing and mechanisms of in memory computations across nodes of the cluster. All of these features are attractive for high performance scientific computing. It has been shown that Apache Spark outperforms Hadoop implementation of some machine learning algorithms by orders of magnitude. Since Hadoop platform is not well suited for iterative computing, typical for many computational problems, in this study we investigate performance characteristics of Apache Spark on scientific computing problems, particularly for solving Dirichlet problem for Poisson's equation. An algorithm for solving Dirichlet problem for Poisson's equation is described and analyzed and compared to optimized Hadoop-based implementations. Apache Spark uses new distributed data structure called RDD. Presented algorithm consists of operations on RDD such as mapping, grouping and partitioning. The benefits and drawbacks of the algorithm as well as applicability for stencil type computations are discussed and analyzed.

Highlights

  • In a modern world there are a lot of large-scale and computationally intensive problems that require highly efficient and well designed approaches to solve them

  • We present in our work an iterative Apache Spark solution to the Dirichlet problem for Poisson’s equation on three-dimensional computational domain which allows efficient iterative execution and provides caching of locally kept chunks of data

  • The results showed that MPI/OpenMP approach is still more than 10 times faster in terms of running time, one should note that Spark has an advantage of caching and authors did not mention this in their paper

Read more

Summary

Introduction

In a modern world there are a lot of large-scale and computationally intensive problems that require highly efficient and well designed approaches to solve them. Apache Spark is a framework for large-scale data processing with the following main features (Zaharia et al, 2010): Data abstractions called Resilient Distributed Datasets (RDD), which allow to perform bulk operations on the data in parallel and cache intermediate results in memory. Data locality property implies that each task should perform operations on those partitions of RDD which are located on its own local memory or which can be fetched from other nodes with minimal network workload and computational resources used. Hadoop has poor performance on iterative tasks since each iteration results in the loss of the jobs execution context and necessary data for the iteration should be loaded again in memory from HDFS. We present in our work an iterative Apache Spark solution to the Dirichlet problem for Poisson’s equation on three-dimensional computational domain which allows efficient iterative execution and provides caching of locally kept chunks of data

Related Work
Experimental Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call