A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

Youngbin Son,Jonghyun Ham,Seokwon Kang,Hongjun Um,Seokho Lee,Donghyeon Kim,Yongjun Park

doi:10.3390/electronics10232960

Abstract

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31× better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

Highlights

As many emerging workloads—such as physics simulations and vision applications—have become more complex, vast, and diverse, they can no longer be handled efficiently by a simple, single central processing unit (CPU)
We propose a method of utilizing vector units as simple scalar processing units, similar to general arithmetic logic units (ALUs) inside CPU cores
Most modern processors are equipped with vector units to exploit data-level parallelism

Summary

Introduction

As many emerging workloads—such as physics simulations and vision applications—have become more complex, vast, and diverse, they can no longer be handled efficiently by a simple, single central processing unit (CPU). To support single-thread performance efficiently, several accelerators such as vector processors, graphics processing units (GPUs), and neural processing units (NPUs) are often utilized. Vector processors have become an essential part of modern computing because they can handle large workloads effectively through tight integration with existing CPUs. Vector processing is a parallel computing technique that exploits data parallelism using a single-instruction multiple-data (SIMD) scheme, and it requires SIMD-purpose hardware and extensive software support to utilize the hardware. SIMD exploits data-level parallelism (DLP) to efficiently run data-parallel code regions in terms of energy and area efficiency [15]. Because of their low programmability and inflexible memory access patterns, SIMD units are often underutilized or not used at all. Maven [17], which is an earlier version of Hwacha [12], was proposed as a hybrid architecture between traditional vector structures and GPUs

Methods

Results

Conclusion