Abstract

Using parallel associative reduction, iterative refinement, and conservative early termination detection, we show how to use tree-reduce parallelism to compute correctly rounded floating-point sums in <inline-formula> <tex-math notation="LaTeX">$O(\log N)$</tex-math></inline-formula> depth. Our parallel solution shows how we can continue to exploit the scaling in transistor count to accelerate floating-point performance even when clock rates remain flat. Empirical evidence suggests our iterative algorithm only requires two tree-reduce passes to converge to the accurate sum in virtually all cases. Furthermore, we develop the hardware implementation of two residue-preserving IEEE-754 double-precision floating-point adders on a Virtex 6 FPGA that run at the same 250 MHz pipeline speed as a standard adder. One adder creates the residue by truncation, requires only 22 percent more area than the standard adder, and allows us to support directed-rounding modes and to lower the cost of round-to-nearest modes. The second adder creates the residue while directly producing a round-to-nearest sum at 48 percent more area than a standard adder.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.