Decimal Floating-point Unit Research Articles

Given the popularity of decimal arithmetic, hardware implementation of decimal operations has been a hot topic of research in recent decades. Besides the four basic operations, the square root can be implemented as an instruction directly in the hardware, which improves the performance of the decimal floating-point unit in the processors. Hardware implementation of decimal square rooters is usually done using either functional or digit-recurrence algorithms. Functional algorithms, entailing multiplication per iteration, seem inadequate to use for decimal square roots, given the high cost of decimal multipliers. On the other hand, digit-recurrence square root algorithms, particularly SRT (this method is named after its creators, Sweeney, Robertson, and Tocher) algorithms, are simple and well suited for decimal arithmetic. This paper, with the intention of reducing the latency of the decimal square root operation while maintaining a reasonable cost, proposes an SRT algorithm and the corresponding hardware architecture to compute the decimal square root. The proposed fixed-point square root design requires n+3 cycles to compute an n-digit root; the synthesis results show an area cost of about 31K NAND2 and a cycle time of 40 FO4. These results reveal the 14 % speed advantage of the proposed decimal square root architecture over the fastest previous work (which uses a functional algorithm) with about a quarter of the area.

The new generation of high-performance decimal floating-point units (DFUs) is demanding efficient implementations of parallel decimal multipliers. In this paper, we describe the architectures of two parallel decimal multipliers. The parallel generation of partial products is performed using signed-digit radix-10 or radix-5 recodings of the multiplier and a simplified set of multiplicand multiples. The reduction of partial products is implemented in a tree structure based on a decimal multioperand carry-save addition algorithm that uses unconventional (non BCD) decimal-coded number systems. We further detail these techniques and present the new improvements to reduce the latency of the previous designs, which include: optimized digit recoders for the generation of 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</sup> -tuples (and 5-tuples), decimal carry-save adders (CSAs) combining different decimal-coded operands, and carry-free adders implemented by special designed bit counters. Moreover, we detail a design methodology that combines all these techniques to obtain efficient reduction trees with different area and delay trade-offs for any number of partial products generated. Evaluation results for 16-digit operands show that the proposed architectures have interesting area-delay figures compared to conventional Booth radix-4 and radix--8 parallel binary multipliers and outperform the figures of previous alternatives for decimal multiplication.

Decimal Floating-point Unit Research Articles

Articles published on Decimal Floating-point Unit

Improving the area of fast parallel decimal multipliers

Decimal SRT Square Root: Algorithm and Architecture

Redundant Floating-Point Decimal CORDIC Algorithm

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Improved Design of High-Performance Parallel Decimal Multipliers

Decimal floating-point support on the IBM System z10 processor

IBM POWER6 accelerators: VMX and DFU

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Decimal Floating-point Unit Research Articles

Articles published on Decimal Floating-point Unit

Improving the area of fast parallel decimal multipliers

Decimal SRT Square Root: Algorithm and Architecture

Redundant Floating-Point Decimal CORDIC Algorithm

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Improved Design of High-Performance Parallel Decimal Multipliers

Decimal floating-point support on the IBM System z10 processor

IBM POWER6 accelerators: VMX and DFU