Abstract

Floating-point division is indispensable and becoming increasingly important in many modern applications. To improve speed performance of floating-point division in actual microprocessors, this paper proposes a low-latency architecture with a multi-precision architecture for floating-point division which will meet the IEEE-754 standard. There are three parts in the floating-point division design: pre-configuration, mantissa division, and quotient normalization. In the part of mantissa division, based on the fast division algorithm, a Predict–Correct algorithm is employed which brings about more partial quotient bits per cycle without consuming too much circuit area. Detailed analysis is presented to support the guaranteed accuracy per cycle with no restriction to specific parameters. In the synthesis using TSMC, 90 nm standard cell library, the results show that the proposed architecture has ≈63.6% latency, ≈30.23% total time (latency × period), ≈31.8% total energy (power × latency × period), and ≈44.6% efficient average energy (power × latency × period/efficient length) overhead over the latest floating-point division structure. In terms of latency, the proposed division architecture is much faster than several classic processors.

Highlights

  • Our research focuses on the architecture design of configurable precision FP arithmetic units

  • Fast division requires hardware with at least one look-up table of size 2m −1 × m bits and three multipliers, a carrying assimilation multiplier of size (m + 1) × n for the divisor’s initial multiplications and a carry-save multiplier of size (m + 1) × m for the quotient

  • Division to multiplicative iterations rather than subtractive iterations [42], pre-scaling operands [43,44,45], using Fourier division [46,47], using alInspired by fast division method [18], this paper proposes a Predict–Correct algorithm ternative digit codes such as binary-coded decimal (BCD) digits instead of decimal and which will increase iteration speed by bringing about n more quotient bits than fast division basic binary digits [48], cascading multiple stages of lower radix dividers [49], overlapwithout consuming many areas

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Literature exists describing division algorithms, of which digit recurrence, functional iteration, variable latency, very high radix, and look-up table are five typical division implementations [3]. The Digital Equipment Corporation (DEC) Alpha 21164 [16] is one of the best examples of variable latency class algorithm implementation It is found in [17] that the average number of quotient bits retired in one iteration varies from 2 to 3 depending on the stream of bits in the partial remainder. In [20], the main difference between SRT and very high-radix algorithm is that it has a more complex divisor multiple processing and quotient-digit selection hardware, which increases the cycle time and area. The proposed architecture is based on very high-radix algorithm [18], which can work out much more than 10-bit quotient in one clock cycle.

Background
Predict–Correct Algorithm with Accurate Quotient Approximation
Guaranteed Bits per Cycle Using Predict–Correct Algorithm
General Architecture and Main Parts
Part 1 PRECONFIG
Part 3 NORMALIZE
Design
Functional Verification
Related Work and Comparisons
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call