Abstract
The hardware accelerator controlled by direct memory access (DMA) is greatly influenced by the communication bandwidth from/to DRAM through on-chip buses. This paper proposes a novel performance estimation algorithm to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to facilitate the optimization of CSs, a communication primitive (CP) is defined by the bank allocation and the set of activated DMACs. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. It is shown that the optimized CS improves communication performance by up to 68% for the third convolutional layers of AlexNet and 60% for the MIMO of LDPC-coded MIMO-OFDM. In addition, the DRAM latency is minimized by setting the bank interleaving to the number of outstanding transactions. Moreover, the simulation results show that the optimum CS depends on the application. It is also shown that the use of an extra DMAC does not necessarily improve the communication performance.
Highlights
R ECENTLY hardware accelerators have become one of the most effective solutions in many areas such as machine learning
Once the communication bandwidths of communication primitive (CP) are obtained from prior simulations, it is possible to estimate the performance of any communication schemes (CSs) on a per direct memory access (DMA) interval basis, importantly, without any additional simulations, since each CS can be expressed as a CP set
In order to facilitate the optimization of CSs, we newly propose to define a CP by the number of activated direct memory access controllers (DMACs) and the per-DMAC bank allocation such that each CS can be expressed as a set of CPs
Summary
R ECENTLY hardware accelerators have become one of the most effective solutions in many areas such as machine learning. This paper deals with the optimization of communication schemes (CSs) for DMA-controlled accelerators, which are characterized by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to speed up the evaluation of CSs, we newly propose a performance estimation algorithm based on communication primitives (CPs), each of which is defined by the number of activated DMACs and per-DMAC bank allocations. The proposed performance estimation algorithm is generally applicable to any accelerators equipped with DMAC-controlled local memories a hardware accelerator of convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM) are considered in this paper as examples.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have