Abstract
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS languages like UPC are growing in popularity because of their ability to provide shared-memory programming model over distributed memory machines. However, since UPC is an emerging standard, it is unlikely that entire applications will be re-written with it. Instead, unified communication runtimes have paved the way for a new class of hybrid applications that can leverage the benefits of both MPI and PGAS models. Such unified runtimes need to be designed in a high performance, scalable manner to improve the performance of emerging hybrid applications. Collective communication primitives offer a flexible, portable way to implement group communication operations and are supported in both MPI and PGAS programming models. Owing to their advantages, they are also widely used across various scientific parallel applications. Over the years, MPI libraries have relied upon aggressive software- /hardware-based and kernel-assisted optimizations to deliver low communication latency for various collective operations. However, there is much room for improvement for collective operations in state-of-the-art, open-source implementations of UPC. In this paper, we address the challenges associated with improving the performance of collective primitives in UPC. Further, we also explore design alternatives to enable collective primitives in UPC to directly leverage the designs available in the MVAPICH2 MPI library. Our experimental evaluations show that our designs improve the performance of the UPC broadcast and all-gather operations, by 25X and 18X respectively for 128KB message at 2,048 processes. Our designs improve the performance of the UPC 2D-Heat kernel by up to 2X times at 2,048 processes, and NAS-FT benchmark by 12% at 256 processes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.