Designing Algorithms on RP3

Luigi Brochard,Alex Freau

doi:10.1002/cpe.4330040106

Abstract

AbstractWe study here the behavior of two numerical algorithms (matrix multiplication and finite difference method) on a three‐level memory hierarchy multi‐processor RP3. Using different versions of these algorithms, which differ on data placement (global, local, global and cacheable, local and cacheable) and on data access (blocked or non‐blocked), we study the impact of these parameters on the performance of the program. This performance analysis is done using a very accurate monitoring system (VPMC) which records instructions, memory requests, cache requests and misses. We perform also a theoretical performance analysis of these programs using a model of computation and communication. Good agreement is found between theoretical and experimental results. As a conclusion we discuss the use of local memory on such a machine and show that it is ineffective with RP3 cache, local and global memory communication speed ratios. We also discuss optimal use of cache and show that the optima can only be realized under some cache properties (private store‐in cache with user control of write‐back) and show that blocked optimal algorithms are to be used to find it. Comparing programming of shared and distributed memory multi‐processors, we remark that optimized algorithms for shared memory systems utilize the same blocking techniques used for programming distributed memory systems, leading to a common programming paradigm.

Full Text