Abstract
The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several UPC program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.
Highlights
Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers
The Unified Parallel C (UPC) with FFTW implementation only uses FFTW for local 1-D Fast Fourier Transform (FFT) and FFTW just searches for the best 1D FFT solution
The scalability of the UPC with FFTW implementation makes it has the best performance when running on 32 cores
Summary
Multi-core processors have become mainstream: they are in almost all types of computing devices from commodity laptops to customized supercomputers. OpenMP provides compiler directives to parallelize for loops but the speedups may be dismal if the data distribution and access locality are not optimized . There are several research compiler infrastructures that support UPC, such as ROSE [14] and OpenUH [13] Both IBM UPC [1] and Berkeley UPC [11] have demonstrated performance scalability on tens of thousands cores. Though UPC and other PGAS languages were initially focused on large scale distributed-memory machines, they are a good fit for emerging multicore systems because the data partitioning capability of PGAS programming models helps users manage data locality efficiently and achieve high performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have