Abstract

Recursive tree traversals are found in many application domains, such as data mining, graphics, machine learning and scientific simulations. In the past few years there has been growing interest in the deployment of applications based on graph data structures on many-core devices. A couple of recent efforts have focused on optimizing the execution of multiple serial tree traversals on GPU, and have reported performance trends that vary across algorithms. In this work, we aim to understand how to select the implementation and platform that is most suited to a given tree traversal algorithm and dataset. To this end, we perform a systematic study of recursive tree traversal on CPU, GPU and the Intel Phi processor. We first identify four tree traversal patterns: three of them performing multiple serial traversals concurrently, and the last one performing a single parallel level order traversal. For each of these patterns, we consider different code variants including existing and new optimization methods, and we characterize their control-flow and memory access patterns. We implement these code variants and evaluate them on CPU, GPU and Intel Phi. Our analysis shows that there is not a single code variant and platform that achieves the best performance on all tree traversal patterns, and it provides guidelines on the selection of the implementation most suited to a given tree traversal pattern and input dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call