Comparison of sort algorithms in Hadoop and PCJ

Marek Nowicki

doi:10.1186/s40537-020-00376-9

Marek Nowicki

Open Access

https://doi.org/10.1186/s40537-020-00376-9

Copy DOI

Journal: Journal of Big Data	Publication Date: Nov 16, 2020
Citations: 4	License type: open-access

Affiliation: Nicolaus Copernicus University

Abstract

Sorting algorithms are among the most commonly used algorithms in computer science and modern software. Having efficient implementation of sorting is necessary for a wide spectrum of scientific applications. This paper describes the sorting algorithm written using the partitioned global address space (PGAS) model, implemented using the Parallel Computing in Java (PCJ) library. The iterative implementation description is used to outline the possible performance issues and provide means to resolve them. The key idea of the implementation is to have an efficient building block that can be easily integrated into many application codes. This paper also presents the performance comparison of the PCJ implementation with the MapReduce approach, using Apache Hadoop TeraSort implementation. The comparison serves to show that the performance of the implementation is good enough, as the PCJ implementation shows similar efficiency to the Hadoop implementation.

Highlights

Sorting is one of the most fundamental algorithmic problems found in a wide range of fields
The implementation using the Parallel Computing in Java (PCJ) library was presented in an iterative way that shows the possible performance problems and the ways to overcome them
The comparison of TeraSort implementations indicates that PCJ performance is similar to Hadoop for a properly configured cluster and even more efficient when using on clusters with drawbacks in the configuration

Summary

Introduction

Sorting is one of the most fundamental algorithmic problems found in a wide range of fields. The basic metrics for data analysis include minimum, maximum, median, and top-K values. It is easy to write simple O(n) algorithms that are not using sorting to calculate the first three of those metrics, but finding the median value and its variants require more work. Hoare’s quickselect algorithm [1] can be used for finding the median of unsorted data, but its worst-case time complexity is O(n2). Some algorithms, like binary search, require data to be sorted before execution. Existing O(n) sorting algorithms, adapted for parallel execution, like count sort [2, 3] or radix sort [4, 5], require specific input data structure, which limits their application to more general cases

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparison of sort algorithms in Hadoop and PCJ

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Performance Comparison of Graph BFS Implemented in MapReduce and PGAS Programming Models
Magdalena Ryczkowska ...
-
Magdalena Ryczkowska, et. al.Magdalena Ryczkowska ...
01 Jan 2018
01 Jan 2018

The Performance Evaluation of the Java Implementation of Graph500
Magdalena Ryczkowska ... Marek Nowicki
-
Magdalena Ryczkowska, et. al.Magdalena Ryczkowska ... Marek Nowicki
01 Jan 2015
01 Jan 2015

Maximum weighted matching using the partitioned global address space model
...
-
, et. al. ...
22 Mar 2009
22 Mar 2009

A Theory of Partitioned Global Address Spaces
...
-
, et. al. ...
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison of sort algorithms in Hadoop and PCJ

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data