Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses

Mikhail Asiatici,Paolo Ienne

doi:10.1109/isca52012.2021.00054

Abstract

Efficient large-scale graph processing is crucial to many disciplines. Yet, while graph algorithms naturally expose massive parallelism opportunities, their performance is limited by the memory system because of irregular memory accesses. State-of-the-art FPGA graph processors, such as ForeGraph and FabGraph, address the memory issues by using scratchpads and regularly streaming edges from DRAM, but then they end up wasting bandwidth on unneeded data. Yet, where classic caches and scratchpads fail to deliver, FPGAs make powerful unorthodox solutions possible. In this paper, we resort to extreme nonblocking caches that handle tens of thousands of outstanding read misses. They significantly increase the ability of memory systems to coalesce multiple accelerator accesses into fewer DRAM memory requests; essentially, when latency is not the primary concern, they bring the advantages expected from a very large cache at a fraction of the cost. We prove our point with an adaptable graph accelerator running on Amazon AWS f1; our implementation takes into account all practical aspects of such a design, including the challenges involved when working with modern multidie FPGAs. Running classic algorithms (PageRank, SCC, and SSSP) on large graphs, we achieve 3× geometric mean speedup compared to state-of-the-art FPGA accelerators, 1.1-5.8 × higher bandwidth efficiency and 3.0-15.3 × better power efficiency than multicore CPUs, and we support much larger graphs than the state-of-the-art on GPUs.

Full Text