Abstract

The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several UPC program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.

Highlights

  • Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers

  • The Unified Parallel C (UPC) with FFTW implementation only uses FFTW for local 1-D Fast Fourier Transform (FFT) and FFTW just searches for the best 1D FFT solution

  • The scalability of the UPC with FFTW implementation makes it has the best performance when running on 32 cores

Read more

Summary

Introduction

Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers. OpenMP provides compiler directives to parallelize for loops but the speedups may be dismal if the data distribution and access locality are not optimized . There are several research compiler infrastructures that support UPC, such as ROSE [14] and OpenUH [13] Both IBM UPC [1] and Berkeley UPC [11] have demonstrated performance scalability on tens of thousands cores. Though UPC and other PGAS languages were initially focused on large scale distributed-memory machines, they are a good fit for emerging multicore systems because the data partitioning capability of PGAS programming models helps users manage data locality efficiently and achieve high performance.

Optimization techniques
Casting shared pointers to local pointers
Selecting memory consistency model
Managing data affinity for NUMA systems
Case studies
Combining UPC with other programming models
Summary
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call