Parallel Prefetching Research Articles

We present GRIP, a graph neural network accelerator architecture designed for low-latency inference. Accelerating GNNs is challenging because they combine two distinct types of computation: arithmetic-intensive <i>vertex-centric</i> operations and memory-intensive <i>edge-centric</i> operations. GRIP splits GNN inference into a three edge- and vertex-centric execution phases that can be implemented in hardware. GRIP specializes each unit for the unique computational structure found in each phase. For vertex-centric phases, GRIP uses a high performance matrix multiply engine coupled with a dedicated memory subsystem for weights to improve reuse. For edge-centric phases, GRIP use multiple parallel prefetch and reduction engines to alleviate the irregularity in memory accesses. Finally, GRIP supports several GNN optimizations, including an optimization called vertex-tiling that increases the reuse of weight data. We evaluate GRIP by performing synthesis and place and route for a <inline-formula><tex-math notation="LaTeX">$28 \;\mathrm{n}\mathrm{m}$</tex-math></inline-formula> implementation capable of executing inference for several widely-used GNN models (GCN, GraphSAGE, G-GCN, and GIN). Across several benchmark graphs, it reduces 99th percentile latency by a geometric mean of <inline-formula><tex-math notation="LaTeX">$17\times$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$23\times$</tex-math></inline-formula> compared to a CPU and GPU baseline, respectively, while drawing only <inline-formula><tex-math notation="LaTeX">$5 \;\mathrm{W}$</tex-math></inline-formula> .

External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping'') as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment[TPI]; whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM ), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM (since SRM requires fewer passes). The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files. Since both parallel disk merging and multimedia processing deal with streams that get ``consumed'' at nonuniform and partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video server applications.

Parallel Prefetching Research Articles

Related Topics

Articles published on Parallel Prefetching

GRIP: A Graph Neural Network Accelerator Architecture

Parallel Prefetching for Canonical Ensemble Monte Carlo Simulations.

Parallel Prefetching for Canonical Ensemble Monte Carlo Simulations.

CAPre: Code-Analysis based Prefetching for Persistent Object Stores

A Simple and Efficient Parallel Disk Mergesort

Near-Optimal Parallel Prefetching and Caching

Run Placement Policies for Concurrent Mergesorts Using Parallel Prefetching

A trace-driven comparison of algorithms for parallel prefetching and caching

Integrated parallel prefetching and caching

Performance comparison of thrashing control policies for concurrent Mergesorts with parallel prefetching

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Parallel Prefetching Research Articles

Related Topics

Articles published on Parallel Prefetching

GRIP: A Graph Neural Network Accelerator Architecture

Parallel Prefetching for Canonical Ensemble Monte Carlo Simulations.

Parallel Prefetching for Canonical Ensemble Monte Carlo Simulations.

CAPre: Code-Analysis based Prefetching for Persistent Object Stores

A Simple and Efficient Parallel Disk Mergesort

Near-Optimal Parallel Prefetching and Caching

Run Placement Policies for Concurrent Mergesorts Using Parallel Prefetching

A trace-driven comparison of algorithms for parallel prefetching and caching

Integrated parallel prefetching and caching

Performance comparison of thrashing control policies for concurrent Mergesorts with parallel prefetching