Abstract

Loop tiling and communication optimization, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based data structures (PBDSs). Our approach, dynamic pointer alignment (DPA), has two components. The compiler decomposes a program into non-blocking threads that operate on specific pointers and labels thread creation sites with their corresponding pointers. At runtime, an explicit mapping from pointers to dependent threads is updated at thread creation and is used to dynamically schedule both threads and communication, such that threads using the same objects execute together, communication overlaps with local work, and messages are aggregated. We have implemented DPA to optimize remote reads to global PBDSs on parallel machines. Our empirical results on the force computation phases of two applications that use sophisticated PBDSs, Barnes-Hut and FMM, show that DPA achieves good absolute performance and speedups by enabling tiling and communication optimization on the CRAY T3D.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call