High-Precision Floating-Point Research Articles

This paper proposes a novel architecture for the computation of XY-like functions based on the QH CORDIC (Quadruple-Step-Ahead Hyperbolic Coordinate Rotation Digital Computer) methodology. The proposed architecture converts direct computing of function XY to logarithm, multiplication, and exponent operations. The QH CORDIC methodology is a parallel variant of the traditional CORDIC algorithm. Traditional CORDIC suffers from long latency and large area, while the QH CORDIC has much lower latency. The computation of functions lnx and ex is accomplished with the QH CORDIC. To solve the problem of the limited range of convergence of the QH CORDIC, this paper employs two specific techniques to enlarge the range of convergence for functions lnx and ex, making it possible to deal with high-precision floating-point inputs. Hardware modeling of function XY using the QH CORDIC is plotted in this paper. Under the TSMC 65 nm standard cell library, this paper designs and synthesizes a reference circuit. The ASIC implementation results show that the proposed architecture has 30 more orders of magnitude of maximum relative error and average relative error than the state-of-the-art. On top of that, the proposed architecture is also superior to the state-of-the-art in terms of latency, word length and energy efficiency (power × latency × period /efficient bits).

Read full abstract

Deep neural network (DNN) quantization converting floating-point (FP) data in the network to integers (INT) is an effective way to shrink the model size for memory saving and simplify the operations for compute acceleration. Recently, researches on DNN quantization develop from inference to training, laying a foundation for the online training on accelerators. However, existing schemes leaving batch normalization (BN) untouched during training are mostly incomplete quantization that still adopts high precision FP in some parts of the data paths. Currently, there is no solution that can use only low bit-width INT data during the whole training process of large-scale DNNs with acceptable accuracy. In this work, through decomposing all the computation steps in DNNs and fusing three special quantization functions to satisfy the different precision requirements, we propose a unified complete quantization framework termed as “WAGEUBN” to quantize DNNs involving all data paths including W (Weights), A (Activation), G (Gradient), E (Error), U (Update), and BN. Moreover, the Momentum optimizer is also quantized to realize a completely quantized framework. Experiments on ResNet18/34/50 models demonstrate that WAGEUBN can achieve competitive accuracy on the ImageNet dataset. For the first time, the study of quantization in large-scale DNNs is advanced to the full 8-bit INT level. In this way, all the operations in the training and inference can be bit-wise operations, pushing towards faster processing speed, decreased memory cost, and higher energy efficiency. Our throughout quantization framework has great potential for future efficient portable devices with online learning ability.

Read full abstract

High-Precision Floating-Point Research Articles

Articles published on High-Precision Floating-Point

Low-Power High Precision Floating-Point Divider With Bidimensional Linear Approximation

An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs

Training high-performance and large-scale deep neural networks with full 8-bit integers

A new lower bound for the de Bruijn-Newman constant

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-Precision Floating-Point Research Articles

Articles published on High-Precision Floating-Point

Low-Power High Precision Floating-Point Divider With Bidimensional Linear Approximation

An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs

Training high-performance and large-scale deep neural networks with full 8-bit integers

A new lower bound for the de Bruijn-Newman constant