Abstract

AbstractWe study here the behavior of two numerical algorithms (matrix multiplication and finite difference method) on a three‐level memory hierarchy multi‐processor RP3. Using different versions of these algorithms, which differ on data placement (global, local, global and cacheable, local and cacheable) and on data access (blocked or non‐blocked), we study the impact of these parameters on the performance of the program. This performance analysis is done using a very accurate monitoring system (VPMC) which records instructions, memory requests, cache requests and misses. We perform also a theoretical performance analysis of these programs using a model of computation and communication. Good agreement is found between theoretical and experimental results. As a conclusion we discuss the use of local memory on such a machine and show that it is ineffective with RP3 cache, local and global memory communication speed ratios. We also discuss optimal use of cache and show that the optima can only be realized under some cache properties (private store‐in cache with user control of write‐back) and show that blocked optimal algorithms are to be used to find it. Comparing programming of shared and distributed memory multi‐processors, we remark that optimized algorithms for shared memory systems utilize the same blocking techniques used for programming distributed memory systems, leading to a common programming paradigm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.