Abstract

Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduceP3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran) programs but partially covers also message passing programs (MPI).P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc.P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC) to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness ofP3T+.

Highlights

  • IntroductionParallelizing and optimizing programs for multiprocessor systems with distributed memory is still a no-– What is the effect of a code change in the performance of a program?– What happens to the performance if problem and machine sizes are modified?– How much performance can be gained by changing a specific machine parameter (e.g. communication bandwidth or cache size)?Clearly this list is incomplete, but it shows, that tools providing accurate performance information to examine some of these effects are of paramount importance.Historically there have been two classes of performance tools

  • Parallelizing and optimizing programs for multiprocessor systems with distributed memory is still a no-– What is the effect of a code change in the performance of a program?– What happens to the performance if problem and machine sizes are modified?– How much performance can be gained by changing a specific machine parameter?Clearly this list is incomplete, but it shows, that tools providing accurate performance information to examine some of these effects are of paramount importance.Historically there have been two classes of performance tools

  • Very few performance estimators consider code transformations and optimizations applied by a compiler

Read more

Summary

Introduction

Parallelizing and optimizing programs for multiprocessor systems with distributed memory is still a no-– What is the effect of a code change in the performance of a program?– What happens to the performance if problem and machine sizes are modified?– How much performance can be gained by changing a specific machine parameter (e.g. communication bandwidth or cache size)?Clearly this list is incomplete, but it shows, that tools providing accurate performance information to examine some of these effects are of paramount importance.Historically there have been two classes of performance tools. This list is incomplete, but it shows, that tools providing accurate performance information to examine some of these effects are of paramount importance. There is extensive work done on monitoring distributed and parallel applications but these approaches have several drawbacks: availability of program and target architecture, long execution times, perturbation of measured performance data, and vast amounts of performance data. There is the class of performance estimators that try to statically examine a program’s performance without executing it on a target architecture. This approach suffers mostly by restricting programs and machines that can be modeled as well as by less accurate results. The time needed to compute performance information can be very short

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call