Memory Hierarchy Performance Research Articles

Graph convolutional neural networks (GCNs) are emerging neural networks for graph structures that include large features associated with each vertex. The operations of GCN can be divided into two phases - aggregation and combination. While the combination just performs matrix multiplications using trained weights and aggregated features, the aggregation phase requires graph traversal to collect features from adjacent vertices. Even though neural network applications rely on GPU’s massively parallel processing, GCN aggregation kernels exhibit rather low performance since graph processing using compressed graph structures provokes frequent irregular accesses in GPUs. In order to investigate the performance hurdles of GCN aggregation on GPU, we perform an in-depth analysis of the aggregation kernels using real GPU hardware and a cycle-accurate GPU simulator. We first analyze the characteristics of the popular graph datasets used for GCN studies. We reveal the fractions of non-zero elements in feature vectors are diverse among datasets. Based on the observation, we build two types of aggregation kernels that handle uncompressed and compressed feature vectors. Our evaluation exhibits the performance of aggregation can be significantly influenced by kernel design approaches and feature density. We also analyze the individual loads that access the data arrays of the aggregation kernels to specify critical loads. Our analysis reveals the performance of GPU memory hierarchy is influenced by access patterns and feature size of graph datasets. Based on our observations we discuss possible kernel design approaches and architectural ideas that can improve the performance of GCN aggregation.

Read full abstract

With the emergence of highly multithreaded architectures, performance monitoring techniques face new challenges in efficiently locating sources of performance discrepancies in the program source code. For example, the state-of-the-art performance counters in highly multithreaded graphics processing units (GPUs) report only the overall occurrences of microarchitecture events at the end of program execution. Furthermore, even if supported, any fine-grained sampling of performance counters will distort the actual program behavior and will make the sampled values inaccurate. On the other hand, it is difficult to achieve high resolution performance information at low sampling rates in the presence of thousands of concurrently running threads. In this paper, we present a novel software-based approach for monitoring the memory hierarchy performance in highly multithreaded general-purpose graphics processors. The proposed analysis is based on memory traces collected for snapshots of an application execution. A trace-based memory hierarchy model with a Monte Carlo experimental methodology generates statistical bounds of performance measures without being concerned about the exact inter-thread ordering of individual events but rather studying the behavior of the overall system. The statistical approach overcomes the classical problem of disturbed execution timing due to fine-grained instrumentation. The approach scales well as we deploy an efficient parallel trace collection technique to reduce the trace generation overhead and a simple memory hierarchy model to reduce the simulation time. The proposed scheme also keeps track of individual memory operations in the source code and can quantify their efficiency with respect to the memory system. A cross-validation of our results shows close agreement with the values read from the hardware performance counters on an NVIDIA Tesla C2050 GPU. Based on the high resolution profile data produced by our model we optimized memory accesses in the sparse matrix vector multiply kernel and achieved speedups ranging from 2.4 to 14.8 depending on the characteristics of the input matrices.

Read full abstract

Memory Hierarchy Performance Research Articles

Related Topics

Articles published on Memory Hierarchy Performance

Analyzing GCN Aggregation on GPU

Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP.

Revisiting loop fusion in the polyhedral framework

Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

An Empirical and Architectural Study of Using an SSD-Aware Hybrid Storage System to Improve the Performance of the Data Intensive Applications

Modèles de stockage orientés interrogation pour bases de données temporelles

Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Memory hierarchy performance measurement of commercial dual-core desktop processors

Model-guided empirical tuning of loop fusion

Precise automatable analytical modeling of the cache behavior of codes with indirections

Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Array control for high-performance SIMD systems

A compiler tool to predict memory hierarchy performance of scientific codes

Tiling, block data layout, and memory hierarchy performance

MisSPECulation

Probabilistic miss equations: evaluating memory hierarchy performance

Using cache line coloring to perform aggressive procedure inlining

Achieving high instruction cache performance with an optimizing compiler

Possible uses of charge-transfer devices and magnetic-domain devices in memory hierarchies

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Memory Hierarchy Performance Research Articles

Related Topics

Articles published on Memory Hierarchy Performance

Analyzing GCN Aggregation on GPU

Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP.

Revisiting loop fusion in the polyhedral framework

Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

An Empirical and Architectural Study of Using an SSD-Aware Hybrid Storage System to Improve the Performance of the Data Intensive Applications

Modèles de stockage orientés interrogation pour bases de données temporelles

Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Memory hierarchy performance measurement of commercial dual-core desktop processors

Model-guided empirical tuning of loop fusion

Precise automatable analytical modeling of the cache behavior of codes with indirections

Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Array control for high-performance SIMD systems

A compiler tool to predict memory hierarchy performance of scientific codes

Tiling, block data layout, and memory hierarchy performance

MisSPECulation

Probabilistic miss equations: evaluating memory hierarchy performance

Using cache line coloring to perform aggressive procedure inlining

Achieving high instruction cache performance with an optimizing compiler

Possible uses of charge-transfer devices and magnetic-domain devices in memory hierarchies