Optimizing UPC Programs for Multi-Core Systems

Yili Zheng

doi:10.1155/2010/646829

Abstract

The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several UPC program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.

Highlights

Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers
The Unified Parallel C (UPC) with FFTW implementation only uses FFTW for local 1-D Fast Fourier Transform (FFT) and FFTW just searches for the best 1D FFT solution
The scalability of the UPC with FFTW implementation makes it has the best performance when running on 32 cores

Summary

Introduction

Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers. OpenMP provides compiler directives to parallelize for loops but the speedups may be dismal if the data distribution and access locality are not optimized . There are several research compiler infrastructures that support UPC, such as ROSE [14] and OpenUH [13] Both IBM UPC [1] and Berkeley UPC [11] have demonstrated performance scalability on tens of thousands cores. Though UPC and other PGAS languages were initially focused on large scale distributed-memory machines, they are a good fit for emerging multicore systems because the data partitioning capability of PGAS programming models helps users manage data locality efficiently and achieve high performance.

Optimization techniques

Casting shared pointers to local pointers

Selecting memory consistency model

Managing data affinity for NUMA systems

Case studies

Combining UPC with other programming models

Summary

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific programming	Publication Date: Jan 1, 2010
Citations: 14	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Optimizing UPC Programs for Multi-Core Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming

Lead the way for us

Similar Papers

Data structures and algorithms in unified parallel C for molecular dynamics

-

01 Jan 2012
01 Jan 2012

A Characterization of Shared Data Access Patterns in UPC Programs
Christopher Barton ... José Nelson Amaral
-
Christopher Barton, et. al.Christopher Barton ... José Nelson Amaral
02 Nov 2006
02 Nov 2006

UPC Architecture for High-Performance Computational Hydrodynamics
Tung T Vu ... Alvin Wei Ze Chew
-
Tung T Vu, et. al.Tung T Vu ... Alvin Wei Ze Chew
01 Jan 2018
01 Jan 2018

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study
Olivier Serres ... Abdullah Kayi
-
Olivier Serres, et. al.Olivier Serres ... Abdullah Kayi
01 Aug 2014
01 Aug 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing UPC Programs for Multi-Core Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming