IBM SP2 Research Articles

This paper starts from a well-known idea, that structure in irregular problems improves sequential performance, and tries to show that the same structure can also be exploited for parallelization of irregular problems on a distributed-memory multicomputer. In particular, we extend a well-known parallelization technique called run-time compilation to use structure information that is explicit on the array subscripts. This paper presents a number of internal representations suited to particular access patterns and shows how various preprocessing structures such as translation tables, trace arrays, and interprocessor communication schedules can be encoded in terms of one or more of these representations. We show how loop and index normalization are important for detection of irregularity in array references, as well as the presence of locality in such references. This paper presents methods for detection of irregularity, feasibility of inspection, and finally, placement of inspectors and interprocessor communication schedules. We show that this process can be automated through extensions to an HPF/Fortran-77 distributed-memory compiler (PARADIGM) and a new runtime support for irregular problems (PILAR) that uses a variety of internal representations of communication patterns. We devise performance measures which consider the relationship between the inspection cost, the execution cost, and the number of times the executor is invoked so that a comparison of the competing schemes can be performed independent of the number of iterations. Finally, we show experimental results on an IBM SP-2 that validate our approach. These results show that dramatic improvements in both memory requirements and execution time can be achieved by using these techniques.

Read full abstract

To efficiently execute a finite element application program on a distributed memory multicomputer, we need to distribute nodes of a finite element graph to processors of a distributed memory multicomputer as evenly as possible and minimize the communication cost of processors. This partitioning problem is known to be NP-complete. Therefore, many heuristics have been proposed to find satisfactory sub-optimal solutions. Based on these heuristics, many graph partitioners have been developed. Among them, Jostle, Metis, and Party are considered as the best graph partitioners available up-to-date. For these three graph partitioners, in order to minimize the total cut-edges, in general, they allow 3% to 5% load imbalance among processors. This is a tradeoff between the communication cost and the computation cost of the partitioning problem. In this paper, we propose an optimization method, the dynamic diffusion method (DDM), to balance the 3% to 5% load imbalance allowed by these three graph partitioners while minimizing the total cut-edges among partitioned modules. To evaluate the proposed method, we compare the performance of the dynamic diffusion method with the directed diffusion method and the multilevel diffusion method on an IBM SP2 parallel machine. Three 2D and two 3D irregular finite element graphs are used as test samples. For each test sample, 3% and 5% load imbalance situations are tested. From the experimental results, we have the following conclusions. (1) The dynamic diffusion method can improve the partition results of these three partitioners in terms of the total cut-edges and the execution time of a Laplace solver in most test cases while the directed diffusion method and the multilevel diffusion method may fail in many cases. (2) The optimization results of the dynamic diffusion method are better than those of the directed diffusion method and the multilevel diffusion method in terms of the total cut-edges and the execution time of a Laplace solver for most test cases. (3) The dynamic diffusion method can balance the load of processors for all test cases.

Read full abstract

IBM SP2 Research Articles

Related Topics

Articles published on IBM SP2

A two-level parallelization strategy for Genetic Algorithms applied to optimum shape design

Optimizing irregular HPF applications using halos

Parallelizing the Dual Simplex Method

BSP clusters: High performance, reliable and very low cost

A Case Study On The Importance Of Compiler And OtherOptimizations For Improving Super-scalar Processor Performance

A Parallel Material-point Method With Application To 3DSolid Mechanics

High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers

Optimizing irregular HPF applications using halos

Compiler and run-time support for exploiting regularity within irregular applications

An Interleaving Transformation for Parallelizing Reductions for Distributed-Memory Parallel Machines

A distributed memory parallel element-by-element scheme for semiconductor device simulation

Semicoarsening Multigrid on Distributed Memory Machines

Parallel sparse linear algebra and application to structural mechanics

Interactive Scientific Visualization Using Parallel Redundant Prediction Method.

A Dynamic Diffusion Optimization Method for Irregular Finite Element Graph Partitioning

ParaPART: parallel mesh partitioning tool for distributed systems

Parallel Krylov Methods for Econometric Model Simulation

A generalized basic-cycle calculation method for efficient array redistribution

Efficient Methods for Multi-Dimensional Array Redistribution

Efficient Parallelization of a Three-Dimensional Navier-Stokes Solver on MIMD Multiprocessors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

IBM SP2 Research Articles

Related Topics

Articles published on IBM SP2

A two-level parallelization strategy for Genetic Algorithms applied to optimum shape design

Optimizing irregular HPF applications using halos

Parallelizing the Dual Simplex Method

BSP clusters: High performance, reliable and very low cost

A Case Study On The Importance Of Compiler And OtherOptimizations For Improving Super-scalar Processor Performance

A Parallel Material-point Method With Application To 3DSolid Mechanics

High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers

Optimizing irregular HPF applications using halos

Compiler and run-time support for exploiting regularity within irregular applications

An Interleaving Transformation for Parallelizing Reductions for Distributed-Memory Parallel Machines

A distributed memory parallel element-by-element scheme for semiconductor device simulation

Semicoarsening Multigrid on Distributed Memory Machines

Parallel sparse linear algebra and application to structural mechanics

Interactive Scientific Visualization Using Parallel Redundant Prediction Method.

A Dynamic Diffusion Optimization Method for Irregular Finite Element Graph Partitioning

ParaPART: parallel mesh partitioning tool for distributed systems

Parallel Krylov Methods for Econometric Model Simulation

A generalized basic-cycle calculation method for efficient array redistribution

Efficient Methods for Multi-Dimensional Array Redistribution

Efficient Parallelization of a Three-Dimensional Navier-Stokes Solver on MIMD Multiprocessors