Abstract
We have developed an environment, based upon robust, existing, open source software, for tuning applications written using MPI, OpenMP or both. The goal of this effort, which integrates the OpenUH compiler and several popular performance tools, is to increase user productivity by providing an automated, scalable performance measurement and optimization system. In this paper we describe our environment, show how these complementary tools can work together, and illustrate the synergies possible by exploiting their individual strengths and combined interactions. We also present a methodology for performance tuning that is enabled by this environment. One of the benefits of using compiler technology in this context is that it can direct the performance measurements to capture events at different levels of granularity and help assess their importance, which we have shown to significantly reduce the measurement overheads. The compiler can also help when attempting to understand the performance results: it can supply information on how a code was translated and whether optimizations were applied. Our methodology combines two performance views of the application to find bottlenecks. The first is a high level view that focuses on OpenMP/MPI performance problems such as synchronization cost and load imbalances; the second is a low level view that focuses on hardware counter analysis with derived metrics that assess the efficiency of the code. Our experiments have shown that our approach can significantly reduce overheads for both profiling and tracing to acceptable levels and limit the number of times the application needs to be run with selected hardware counters. In this paper, we demonstrate the workings of this methodology by illustrating its use with selected NAS Parallel Benchmarks and a cloud resolving code.
Highlights
The difficulty of developing high performance applications has increased greatly with the growth in size and architectural complexity of each new generation of supercomputers
A single address space is seen by all the processors/nodes and its global memory is based on a cache-coherent Non-Uniform Memory Access system implemented via the NUMAlink4
In this thesis we have presented a methodology for solving performance problems that exploits the capabilities of an integrated tuning environment created in a collaboration between open source compiler developers and performance tools providers
Summary
The difficulty of developing high performance applications has increased greatly with the growth in size and architectural complexity of each new generation of supercomputers. Sampling entire applications can yield large amounts of low level information that can be overwhelming for the user Some processor architectures, such as the Itanium 2 processors [15] and the PowerPC [23], support sampling by providing specialized hardware such as performance monitoring units. PDT [12] is a toolkit that was designed in an attempt to overcome the lack of a portable compiler instrumentation API with support for C, C++, Fortran and OpenMP It gathers static program information via a parser and represents it in a portable format suitable for use in source code instrumentation. These capabilities play a significant role in the reduction of instrumentation points, in reducing the instrumentation overhead and the size of performance trace files, and in improving a user’s ability to determine the impact of program optimizations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have