Optimizing Collective Communication in UPC

Jithin Jose,Jie Zhang,Akshay Venkatesh,Khaled Hamidouche,Dhabaleswar K Panda

doi:10.1109/ipdpsw.2014.49

Abstract

Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS languages like UPC are growing in popularity because of their ability to provide shared-memory programming model over distributed memory machines. However, since UPC is an emerging standard, it is unlikely that entire applications will be re-written with it. Instead, unified communication runtimes have paved the way for a new class of hybrid applications that can leverage the benefits of both MPI and PGAS models. Such unified runtimes need to be designed in a high performance, scalable manner to improve the performance of emerging hybrid applications. Collective communication primitives offer a flexible, portable way to implement group communication operations and are supported in both MPI and PGAS programming models. Owing to their advantages, they are also widely used across various scientific parallel applications. Over the years, MPI libraries have relied upon aggressive software- /hardware-based and kernel-assisted optimizations to deliver low communication latency for various collective operations. However, there is much room for improvement for collective operations in state-of-the-art, open-source implementations of UPC. In this paper, we address the challenges associated with improving the performance of collective primitives in UPC. Further, we also explore design alternatives to enable collective primitives in UPC to directly leverage the designs available in the MVAPICH2 MPI library. Our experimental evaluations show that our designs improve the performance of the UPC broadcast and all-gather operations, by 25X and 18X respectively for 128KB message at 2,048 processes. Our designs improve the performance of the UPC 2D-Heat kernel by up to 2X times at 2,048 processes, and NAS-FT benchmark by 12% at 256 processes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing Collective Communication in UPC

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation
Jithin Jose ... Miao Luo
-
Jithin Jose, et. al.Jithin Jose ... Miao Luo
01 Sep 2012
01 Sep 2012

Chapter 16 - PGAS programming models
Jim Jeffers ... Avinash Sodani
Intel Xeon Phi Processor High Performance Programming | VOL. -
Jim Jeffers, et. al.Jim Jeffers ... Avinash Sodani
01 Jan 2015
Intel Xeon Phi Processor High Performance Programming | VOL. -

Experiences with UPC on TILE-64 processor
Olivier Serres ... Tarek El-Ghazawi
-
Olivier Serres, et. al.Olivier Serres ... Tarek El-Ghazawi
01 Mar 2011
01 Mar 2011

Predictive energy management techniques for PGAS programming
David K Newsom ... Tarek El-Ghazawi
-
David K Newsom, et. al.David K Newsom ... Tarek El-Ghazawi
01 May 2013
01 May 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing Collective Communication in UPC

Abstract

Talk to us

Similar Papers