A Performance Tuning Methodology with Compiler Support

Oscar Hernandez,Haoqiang Jin,Barbara Chapman

doi:10.1155/2008/752801

Oscar Hernandez, Haoqiang Jin + Show 1 more

Open Access

https://doi.org/10.1155/2008/752801

Copy DOI

Abstract

We have developed an environment, based upon robust, existing, open source software, for tuning applications written using MPI, OpenMP or both. The goal of this effort, which integrates the OpenUH compiler and several popular performance tools, is to increase user productivity by providing an automated, scalable performance measurement and optimization system. In this paper we describe our environment, show how these complementary tools can work together, and illustrate the synergies possible by exploiting their individual strengths and combined interactions. We also present a methodology for performance tuning that is enabled by this environment. One of the benefits of using compiler technology in this context is that it can direct the performance measurements to capture events at different levels of granularity and help assess their importance, which we have shown to significantly reduce the measurement overheads. The compiler can also help when attempting to understand the performance results: it can supply information on how a code was translated and whether optimizations were applied. Our methodology combines two performance views of the application to find bottlenecks. The first is a high level view that focuses on OpenMP/MPI performance problems such as synchronization cost and load imbalances; the second is a low level view that focuses on hardware counter analysis with derived metrics that assess the efficiency of the code. Our experiments have shown that our approach can significantly reduce overheads for both profiling and tracing to acceptable levels and limit the number of times the application needs to be run with selected hardware counters. In this paper, we demonstrate the workings of this methodology by illustrating its use with selected NAS Parallel Benchmarks and a cloud resolving code.

Highlights

The difficulty of developing high performance applications has increased greatly with the growth in size and architectural complexity of each new generation of supercomputers
A single address space is seen by all the processors/nodes and its global memory is based on a cache-coherent Non-Uniform Memory Access system implemented via the NUMAlink4
In this thesis we have presented a methodology for solving performance problems that exploits the capabilities of an integrated tuning environment created in a collaboration between open source compiler developers and performance tools providers

Summary

Introduction

The difficulty of developing high performance applications has increased greatly with the growth in size and architectural complexity of each new generation of supercomputers. Sampling entire applications can yield large amounts of low level information that can be overwhelming for the user Some processor architectures, such as the Itanium 2 processors [15] and the PowerPC [23], support sampling by providing specialized hardware such as performance monitoring units. PDT [12] is a toolkit that was designed in an attempt to overcome the lack of a portable compiler instrumentation API with support for C, C++, Fortran and OpenMP It gathers static program information via a parser and represents it in a portable format suitable for use in source code instrumentation. These capabilities play a significant role in the reduction of instrumentation points, in reducing the instrumentation overhead and the size of performance trace files, and in improving a user’s ability to determine the impact of program optimizations

Contents of the paper

Related work

The OpenUH compiler and Dragon analysis tool

PerfSuite

Tools interactions

Compile time instrumentation

Tuning methodology and selective instrumentation

Description of the methodology

Selective instrumentation analysis

Case studies

Application description

Evaluating selective instrumentation

Performance analysis for the BT MPI benchmark

Performance analysis of the cloud code

Findings

Conclusions and future work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific programming	Publication Date: Jan 1, 2008
Citations: 20	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

A Performance Tuning Methodology with Compiler Support

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming

Lead the way for us

Similar Papers

Applications and use Cases of Multilevel Granularity for Network Traffic Classification
Faiz Zaki ... Nor Badrul Anuar
-
Faiz Zaki, et. al.Faiz Zaki ... Nor Badrul Anuar
01 Feb 2020
01 Feb 2020

Sometimes “Tomorrow” is “Sometime”
José Luiz Fiadeiro ... Tom Maibaum
-
José Luiz Fiadeiro, et. al.José Luiz Fiadeiro ... Tom Maibaum
01 Jan 1993
01 Jan 1993

Multiresolution texture analysis for human oocyte cytoplasm description
Laura Caponetti ... Gianluca Sforza
-
Laura Caponetti, et. al.Laura Caponetti ... Gianluca Sforza
01 May 2009
01 May 2009

Level of Modularity and Different Levels of System Granularity
Noemi Chiriac ... Katja Hölttä-Otto
Journal of Mechanical Design | VOL. 133
Noemi Chiriac, et. al.Noemi Chiriac ... Katja Hölttä-Otto
01 Oct 2011
Journal of Mechanical Design | VOL. 133

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Performance Tuning Methodology with Compiler Support

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming