CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Cheng Qian,Zhiying Wang,Hui Guo,Libo Huang,Bruce Childers

doi:10.3390/electronics7110307

Cheng Qian, Zhiying Wang + Show 3 more

Open Access

PDF Available

https://doi.org/10.3390/electronics7110307

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.

Highlights

We live in the Big Data era, and today’s memory-bound applications put even more pressure on the memory system than before, forcing memory techniques to advance and new approaches to be adopted
(2) Memory accesses are split into three parts according to their target array, and each one is assigned to a corresponding prefetcher
To illustrate the impact of this optimization work, we compared the performance between the situations with and without this optimization

Summary

Introduction

We live in the Big Data era, and today’s memory-bound applications put even more pressure on the memory system than before, forcing memory techniques to advance and new approaches to be adopted. If an application shows an irregular access pattern, the prefetch technique and parallel architecture will not be able to substantially improve the performance of such applications. This means that such applications may not enjoy the benefits of high bandwidth, which the prefetch technique and parallel architecture rely on. These applications do not work very well due to their unpredictable access patterns, poor spatial locality and sometimes data-dependent accesses.

Objectives

Results

Conclusion