Although parallel algorithms using linked lists, trees, and graphs have been studied extensively by the research community, implementations have met with limited success, even for the simplest algorithms. In this paper we present our results of a careful implementation study of parallel list ranking (and the related list-scan operation) and show that it can have substantial speed up over fast workstations. Obtaining good parallel performance for list ranking is a challenge for two reasons. First, although there are many asymptotic work-efficient algorithms, it is hard to keep the constants comparable to those of the sequential algorithm. In this paper we introduce a new parallel algorithm that both is work efficient and has small constant but sacrifices logarithmic time; it achieves only anO(log2n) time. We contend, however, that work efficiency and small constants are more important, given that multiprocessor machines are used for problems that are much larger than the number of processors and, therefore, the optimalO(logn) time is never achieved in practice. Second, list ranking is highly communication intensive and its memory access patterns are dynamic. We show, however, that by using high memory bandwidth multiprocessors, such as CrayC90 computers, programmed with virtual processors to hide latency we can ameliorate these problems. To the best of our knowledge, our implementation of list ranking and list scan on the CrayC90 is the fastest to date and is the first implementation that substantially outperforms fast workstations. The success of our algorithm is due to its moderate grain size and simplicity; the success of the implementation is due to pipelining reads and writes through vectorization to hide latency and optimizing performance by analyzing the expected execution time of the algorithm.
Read full abstract