Impact of Data Distribution on Performance of Irregular Reductions on Multithreaded Architectures⋆

Gary Zoppetti,Gagan Agrawal,Rishi Kumar

doi:10.1007/3-540-48228-8_49

Abstract

Computations from many scientific and engineering domains use irregularmeshes and/or sparse matrices. The codes expressing these computations involve irregular reductions. The main characteristics of irregular reduction loops are 1) elements of left-hand-side arrays may be incremented in multiple iterations of the loop, but only using associative and commutative operations (these arrays are called reduction arrays), 2) there are no loop carried dependencies, except on elements of reduction arrays, and 3) one or more arrays are accessed using indirection arrays.It is very challenging to efficiently parallelize codes involving irregular reductions, especially on large parallel machines. Because of accesses through indirection arrays, communication and locality are hard to manage. Not only is the total communication volume large, but the communication requirements typically cannot be determined at compile time. It is also hard to efficiently allocate space for non-local elements. Particularly, there are no effective solutions for parallelization of adaptive irregular reductions. In an adaptive irregular reduction, the elements of the indirection arrays are modified after every few iterations. This significantly increases the overhead associated with partitioning and runtime preprocessing routines [7,11], which have been critical for achieving locality, communication efficiency, and effective buffer management.Recently, there has been much interest in multithreaded architectures. A multiprocessor based upon a multithreaded architecture supports multiple threads of execution on each processor. These architectures also support low-cost thread initiation, low-overhead communication, and efficient communication and synchronization between threads on different processors. Multithreaded architectures are considered a promising medium for scalable parallelization of irregular applications, where the frequent communication and synchronization make parallelization hard on conventional parallel machines.We have developed an execution strategy for irregular reductions on a multithreaded architecture. The key idea in our execution model is that the frequency and volume of communication is independent of the contents of the indirection arrays. Thus, unlike other approaches to scalable parallelization of irregular reductions, our approach does not require mesh partitioning [1], array renumberingThough the frequency and volume of communication is independent of the data distribution, other factors are not.KeywordsBlock StrategyLoad ImbalanceExecution StrategyBlock VersionProcessor VersionThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text