Abstract
Optimal partitioning of a square computational domain over several heterogeneous processors, balancing the load of the processors and minimizing the inter-processor communication cost, is crucial for data parallel dense linear algebra and other applications having similar communication pattern on modern hybrid servers. Although a solution has been found for two processors, the cases of three and more processors are still open. The state of-the-art solution for three processors uses an approximation communication cost function which fails to accurately account for the total amount of data moved between processors, leaving thus the question of its global optimality unanswered. In this work, we formulate and solve a mathematical problem of optimal partitioning a real-valued square over three heterogeneous processors using a new cost function, which accurately accounts for the total amount of data communicated between processors. We also develop an original method for accurate experimental evaluation of the communication time of data movement between memories of the compute devices in the hybrid platform during the execution of data parallel applications. We successfully use this method in the experimental validation of our mathematical results. Finally, we propose a communication energy model predicting the dynamic energy consumption of data movement between processors and experimentally validate its accuracy. This model predicts, and the experiments confirm, that the performance-optimal partition is not necessarily energy optimal.
Highlights
The problem of matrix partitioning over heterogeneous processors originates in dense linear algebra on heterogeneous platforms
In this original optimization problem and is first included in the problem by Beaumont et al [3], aiming to minimize the computation time and the communication cost. The latter is formalized as the sum of half-perimeters of rectangular submatrices [3], which is motivated by the communication cost of 2D parallel matrix-matrix multiplications algorithms [4]
We propose a solution of this problem using a communication cost function, which accurately represents the total amount of data moved between the processors, and proves the global optimality of the identified optimal partitions
Summary
The problem of matrix partitioning over heterogeneous processors originates in dense linear algebra on heterogeneous platforms. We propose a solution of this problem using a communication cost function, which accurately represents the total amount of data moved between the processors, and proves the global optimality of the identified optimal partitions. The main contributions of the presented work include: 1) We propose an integer-valued cost function of an arbitrary partition of a square matrix over three heterogeneous processors, which returns the exact number of matrix elements moved between the processors during the execution of the parallel matrix-matrix multiplication algorithm. 2) We use the proposed accurate cost function to introduce a mathematical problem of globally optimal partitioning of a real-valued square We solve this problem and prove the correctness of the solution.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.