Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-style instructions and variable length tensor extensions. It features a multidimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC ISAs and provides software compatibility for scalable hardware implementations. We present an implementation of the TCX tensor computing accelerator using an out-of-order microarchitecture implementation. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described which allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements based on tensor dimensions. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depth-wise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4096 multiplication-accumulation compute unit with up to 98.83% MAC utilization. It consumes 12.8 square millimeters while dissipating 0.46 Watts per TOP in TSMC 28nm technology.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call