Abstract

X10 is a partitioned global address space programming language that supports the notion of places ; a place consists of some data and some lightweight tasks called activities. Each activity runs at a place and may invoke a place-change operation (using the at-construct) to synchronously perform some computation at another place. These place-change operations can be very expensive, as they need to copy all the required data from the current place to the remote place. However, identifying the necessary number of place-change operations and the required data during each place-change operation are non-trivial tasks, especially in the context of irregular applications (like graph applications) that contain complex code with large amounts of cross-referencing objects—not all of those objects may be actually required, at the remote place. In this article, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two inter-related new optimizations: (i) AT-Opt, which minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning, which identifies/elides redundant place-change operations and does parallel execution of place-change operations. AT-Opt uses a novel abstraction, called abstract-place-tree , to capture place-change operations in the program. For each place-change operation, AT-Opt uses a novel inter-procedural analysis to precisely identify the data required at the remote place in terms of the variables in the current scope. AT-Opt then emits the appropriate code to copy the identified data-items to the remote place. AT-Pruning introduces a set of program transformation techniques to emit optimized code such that it avoids the redundant place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72× and 17.83× on a four-node (32 cores per node) Intel and two-node (16 cores per node) AMD system, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call