Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm 2 while dissipating 0.46W/TOPs in TSMC 28-nm technology.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call