Enable back memory and global synchronization on LLC buffer

Licheng Yu,Tiefei Zhang,Yulong Pei,Xueqing Lou,Minghui Wu,Tianzhou Chen

doi:10.1007/s11227-017-2093-8

Abstract

The last-level cache (LLC) shared by heterogeneous processors such as CPU and general-purpose graphics processing unit (GPGPU) brings new opportunities to optimize data sharing among them. Previous work introduces the LLC buffer, which uses part of the LLC storage as a FIFO buffer to enable data sharing between CPU and GPGPU with negligible management overhead. However, the baseline LLC buffer’s capacity is limited and can lead to deadlock when the buffer is full. It also relies on inefficient CPU kernel relaunch and high overhead atomic operations on GPGPU for global synchronization. These limitations motivate us to enable back memory and global synchronization on the baseline LLC buffer and make it more practical. The back memory divides the buffer storage into two levels. While they are managed as a single queue, the data storage in each level is managed as individual circular buffer. The data are redirected to the memory level when the LLC level is full, and are loaded back to the LLC level when it has free space. The case study of n-queen shows that the back memory has a comparative performance with a LLC buffer of infinite LLC level. On the contrary, LLC buffer without back memory exhibits 10% performance degradation incurred by buffer space contention. The global synchronization is enabled by peeking the data about to be read from the buffer. Any request to read the data in LLC buffer after the global barrier is allowed only when all the threads reach the barrier. We adopt breadth-first search (BFS) as a case study and compare the LLC buffer with an optimized implementation of BFS on GPGPU. The results show the LLC buffer has speedup of 1.70 on average. The global synchronization time on GPGPU and CPU is decreased to 38 and 60–5%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enable back memory and global synchronization on LLC buffer

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Similar Papers

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches
Marvin Damschen ... Frank Mueller
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 37
Marvin Damschen, et. al.Marvin Damschen ... Frank Mueller
01 Nov 2018
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 37

Last level cache layout remapping for heterogeneous systems
Licheng Yu ... Xueqing Lou
Journal of Systems Architecture | VOL. 87
Licheng Yu, et. al.Licheng Yu ... Xueqing Lou
10 May 2018
Journal of Systems Architecture | VOL. 87

Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor
Anup Holey ... Vineeth Mekkat
ACM Transactions on Architecture and Code Optimization | VOL. 12
Anup Holey, et. al.Anup Holey ... Vineeth Mekkat
09 Mar 2015
ACM Transactions on Architecture and Code Optimization | VOL. 12

Highly reliable and low-power nonvolatile cache memory with advanced perpendicular STT-MRAM for high-performance CPU
Hiroki Noguchi ... Kazutaka Ikegami
-
Hiroki Noguchi, et. al.Hiroki Noguchi ... Kazutaka Ikegami
01 Jun 2014
01 Jun 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enable back memory and global synchronization on LLC buffer

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing