Abstract

Domain-specific architectures (DSAs) or hardware accelerators are typical innovations that are leading computer architecture into a new golden age. In a heterogeneous system, these tailored processors (accelerators) are managed by and can work in parallel with the general-purpose CPUs with the help of high-speed input/output (I/O) bus or System on Chip (SoC) bus. However, the high communication overhead makes such loosely coupled architecture unsuitable for small-scale or low-latency tasks. Although integrating accelerators into the CPU pipeline as functional units can significantly reduce the interaction latency, due to the performance side effects to CPU micro-architecture and the increasing design and verification complexity of processors, such tightly coupled architecture is only suitable for very simple tasks. Moreover, the speedup (or utilization) of the tightly coupled accelerator would become limited, because of the different design principles of specialized hardware accelerators and general-purpose CPUs.In this paper, we propose TCADer, a novel tightly coupled accelerator design framework for the heterogeneous system with hardware/software co-design, and it has the following key features: (1) It provides a software runtime and hardware integration environment with low communication overhead for various accelerators, especially effective for fine-grained offloading of small-scale or low-latency tasks; (2) It has a coprocessor management unit that takes over complex memory accesses, ensuring the independence of accelerator and CPU with extremely low interaction latency; and (3) A lightweight runtime environment, called fence model, is proposed to support accelerator usually running in user-mode. With TCADer, we implement five different types of accelerators and integrate them into a real processor. Experimental results show that TCADer obtains the advantages of tightly and loosely coupled integration methods while avoiding their disadvantages. More specifically, compared with the loosely coupled method, TCADer reduces the communication overhead by 98.2%, which is similar to tightly coupled method. Compared with the tightly coupled method, TCADer gets 2.76x speedup. TCADer also supports programmability, and further evaluations show that TCADer is suitable for small-scale, frequent-interactive tasks and can also explore the potential fine-grained parallelism within computing tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call