An Adaptive Clock Scheme Exploiting Instruction-Based Dynamic Timing Slack for a GPGPU Architecture

Tianyu Jia,Russ Joseph,Yijie Wei,Jie Gu

doi:10.1109/jssc.2020.2979451

Tianyu Jia, Russ Joseph + Show 2 more

Open Access

https://doi.org/10.1109/jssc.2020.2979451

Copy DOI

Journal: IEEE Journal of Solid-state Circuits	Publication Date: Aug 1, 2020
Citations: 16	License type: publisher-specific, author manuscript

Affiliation: Northwestern University

Abstract

This article presents an adaptive clock scheme to exploit instruction-based dynamic timing slack (DTS) for a general-purpose graphics processor unit (GPGPU) architecture. Based on the developed transitional static timing analysis, the deterministic DTS can be identified for each instruction at different pipeline stages. A critical path (CP) messenger scheme was designed to monitor the runtime utilization of CPs. Both real-time issued instruction information and CP messengers are utilized to determine the runtime DTS margin and guide the cycle-by-cycle clock period adjustment. To apply the proposed adaptive clock on GPGPU, a hierarchical clocking scheme is built including a global phase-locked loop (PLL) and local delay-locked loop (DLL)-based clock generator inside each compute unit (CU). Each CU core contains its own clock domain with adjustable local clocking. In addition, to exploit error-resilient characteristics of the neural network, an elastic pipeline clocking scheme is developed to redistribute the timing margin across pipeline stages for machine learning computations. Measurement results from the implemented open-source GPGPU architecture on a 65 nm CMOS process demonstrate up to 18% performance improvement or equivalent 30% energy saving can be obtained by exploiting the deterministic instruction-based DTS. The proposed elastic pipeline clocking can gain an additional 8% energy saving with small accuracy degradation for neural network inference operations.

Full Text