In a computing platform composed of several homogeneous processors, any parallel schedule of an algorithm usually involves three basic costs: arithmetic throughput on each processor, data movement between processors, and synchronization latency for several processors. The trade-offs between these three costs could realistically reflect lower bounds on the execution time for an algorithm. Therefore, the trade-off analysis is important for evaluating the optimality of a proposed schedule, and often yields new insights in parallel optimization. In this paper, we focus on the trade-offs between computation, communication, and synchronization in the stencil-collective alternate update, which is often executed repeatedly by the complex workflow with multiple stages in most numerical methods, such as the conjugate gradient (CG) method, the nonlinear time integration method in the dynamical core of a global atmospheric general circulation model (AGCM), and so on. Firstly, in order to formalize a workflow with multiple different stages, a novel operator representation of parallel algorithms is proposed. Based on the operator representation, we find the minimum vertex separator of the dependency graph for a stencil-collective alternate update. This breakthrough brings us the opportunity to obtain the cost lower bounds. Next, the general trade-off theory of the stencil-collective alternate update is founded successfully, which extends the recent trade-off theory to a more general theoretical context. Finally, by applying the general theoretical result to several algorithms, namely CG method and the nonlinear time integration method in AGCM, we obtain the corresponding lower bounds of computational cost, communication throughput, and synchronization latency. It should be noted that the general theory can also be widely used to analyze other complex numerical methods in real-world applications.
Read full abstract