In order to exploit the flexibility of OpenMP in parallelizing large scale multi-physics applications where different modes of parallelism are needed for efficient computation, it is first necessary to be able to scale OpenMP codes as well as MPI on large core counts. In this research we have implemented fine grained OpenMP parallelism for a large CFD code GenIDLEST and investigated the performance from 1 to 256 cores using a variety of performance optimization and measurement tools. It is shown through weak and strong scaling studies that OpenMP performance can be made to match that of MPI on the SGI Altix systems for up to 256 cores. Data placement and locality were established to be key components in obtaining good scalability with OpenMP. It is also shown that a hybrid implementation on a dual core system gives the same performance as standalone MPI or OpenMP. Finally, it is shown that in irregular multi-physics applications which do not adhere solely to the SPMD (Single Process, Multiple Data) mode of computation, as encountered in tightly coupled fluid-particulate systems, the flexibility of OpenMP can have a big performance advantage over MPI.
Read full abstract