Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

Mireya Paredes,Graham Riley,Mikel Lujan

doi:10.1109/tpds.2019.2927451

Mireya Paredes, Graham Riley + Show 1 more

Open Access

https://doi.org/10.1109/tpds.2019.2927451

Copy DOI

Abstract

Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the parallelisation of BFS has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This paper investigates the optimisation of the BFS on the Xeon Phi (Knights Corner), a modern parallel architecture provided with an advanced vector processor supporting the AVX-512 instruction set, using a bespoke development framework integrated with the Graph 500 benchmark. In addition, to demonstrate portability, we show results for a direct port of the algorithms to a more recent version of the Xeon Phi (Knights Landing) and to a Skylake CPU which supports most of the AVX-512 instruction set. Optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, a hybrid BFS algorithm. On the KNC our best implementations result in speedups of 1.37x ( top-down ) and 1.37x ( hybrid ), for a one million vertices graph, compared to the state-of-the-art. On the KNL and Skylake, the performance is higher than on KNC. In addition, we show results of our best hybrid algorithm on real-world graphs from the SNAP datasets with speedups up to 1.3x on KNC. Performance on KNL and Skylake is again higher, demonstrating the robustness and portability of our algorithm. The hybrid BFS algorithm can be further used to speed up other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting existing and future models of the Xeon Phi and other advanced vector architectures.

Highlights

MODERN applications process impressive amounts of data
This paper presents the vectorisation of the bottom-up approach of the hybrid Breadth-First Search (BFS) algorithm which at first sight is not vector friendly
The contributions of this work are, first, a systematic analysis on the KNC of the vectorised version of the bottom-up BFS algorithm performance based on hardware performance counters using Performance Application Programming Interface library (PAPI) library achieving a maximum speedup of 33 percent for graph size of one million vertices

Summary

Introduction

MODERN applications process impressive amounts of data. Graph analysis has emerged as a key area for the analysis of this data as graphs can represent entities in terms of vertices and their relationships in terms of edges. It is common to look for patterns within these relationships, aiming to extract information to be further analysed. The Breadth-First Search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel and distributed systems. From these studies, the BFS parallelisation has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. Only 6 papers (see Table 1) have looked at recent parallel architectures using advanced vector units; e.g., SIMD Intel AVX-512

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Oct 15, 2019
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi
Mireya Paredes ... Graham Riley
-
Mireya Paredes, et. al.Mireya Paredes ... Graham Riley
15 May 2017
15 May 2017

Research of the Properties of the Breadth-First Search Algorithm for Finding the Movement Route of Robots
S G Emelianov ... A G Kryukov
Proceedings of the Southwest State University | VOL. 26
S G Emelianov, et. al.S G Emelianov ... A G Kryukov
24 Mar 2023
Proceedings of the Southwest State University | VOL. 26

Using the Intel Many Integrated Core to accelerate graph traversal
Tao Gao ... Guang Suo
The International Journal of High Performance Computing Applications | VOL. 28
Tao Gao, et. al.Tao Gao ... Guang Suo
28 Feb 2014
The International Journal of High Performance Computing Applications | VOL. 28

Layout optimization of oil-gas gathering and transportation system in constrained three-dimensional space
Yang Liu ... Shuangqing Chen
Chinese Science Bulletin | VOL. 65
Yang Liu, et. al.Yang Liu ... Shuangqing Chen
21 Jan 2020
Chinese Science Bulletin | VOL. 65

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems