SummaryThe Floyd‐Warshall algorithm is a well‐known algorithm to compute the distance of all pairs of nodes of a graph. The Blocked Floyd‐Warshall algorithm, a variant of the Floyd‐Warshall has been proposed to accelerate the Floyd‐Warshall algorithm by means of a graphics processing unit (GPU) architecture. The previously published GPU implementations for the Blocked Floyd‐Warshall algorithm perform many separated kernel calls for costly barrier synchronization. The main contribution of this article is to present efficient implementations of the Blocked Floyd‐Warshall algorithm, which performs no barrier synchronization and invokes only one kernel call. Experimental results using NVIDIA Tesla V100 show that our implementation runs 1.05‐1.31 times faster than the previously published one. Our implementation with SIMD functions also runs 1.00‐1.28 times faster than it. Second, we propose efficient GPU implementations to execute the Blocked Floyd‐Warshall algorithm for many graphs at the same time. From the experimental results, our single kernel implementation runs 1.03‐1.60 times faster than multiple kernel one. In terms of implementations with SIMD functions, our single kernel implementation runs 1.01‐1.89 times faster than it. We also propose the low‐latency implementations for many graphs. Finally, we implemented the parallel Floyd‐Warshall algorithm on the multicore processors.
Read full abstract