The main contribution of this paper is to introduce two parallel memory machines, the discrete memory machine (DMM) and the unified memory machine (UMM). Unlike well-studied theoretical parallel computational models such as parallel random access machines, these parallel memory machines are practical and capture the essential feature of the memory access by graphical processing units (GPUs). As a first step of the development of algorithmic techniques on the DMM and the UMM, we first evaluate the computing time for the contiguous access and the stride access to the memory on these models. We then present parallel algorithms to transpose a 2D array on these models and evaluate their performance. Finally, we show that, for any permutation given in offline, data in an array can be moved efficiently along the given permutation both on the DMM and on the UMM. Since the computing time of our permutation algorithms on the DMM and the UMM is equal to the sum of the lower bounds obtained from the memory bandwidth limitation and the latency limitation, they are optimal from the theoretical point of view. We believe that the DMM and the UMM can be good theoretical platforms to develop algorithmic techniques for GPUs.
Read full abstract