Abstract

We present a fast algorithm together with its low-level implementation of correctly rounded arbitrary-precision floating-point summation. The arithmetic is the one used by the GNU MPFR library: radix 2; no subnormals; each variable (each input and the output) has its own precision. We also give a worst-case complexity of this algorithm and describe how the implementation is tested.

Highlights

  • In a floating-point system, the summation operation consists in evaluating the sum of several floating-point numbers

  • The IEEE 754 standard for floating-point arithmetic introduced the sum reduction operation in its 2008 revision [1, Clause 9.4], but does not provide specifications except related to special inputs and exceptions; the only specified finite result is that the result of the sum of 0 numbers is defined as +0

  • The IEEE 1788-2015 standard for interval arithmetic goes further by completely specifying this sum operation for IEEE 754 floating-point formats [2, Clause 12.12.12], in particular requiring correct rounding and specifying the sign of an exact zero result, but in a way that is incompatible with IEEE 754-2008 since in particular, the result of the sum of 0 numbers is −0 in the roundTowardNegative rounding direction

Read more

Summary

INTRODUCTION

In a floating-point system, the summation operation consists in evaluating the sum of several floating-point numbers. We deal with the extension of the summation operation to arbitrary precision in radix 2, where each number has its own precision and results must be correctly rounded, as with the GNU MPFR library1 [9], where this function is named mpfr_sum. Demmel/Nguyen parallel fixed error bound involving the maximum of the input numbers yes floating point mpfr_sum sequential arbitrary correct rounding yes, implied by correct rounding based on arrays of integers. For mpfr_sum, we need to handle cancellation in order to always get an accurate result, which is the main difficulty; the correct-rounding requirement mainly adds more subcases, but it does not introduce additional issues from a theoretical point of view: we will see that guaranteeing a correctly rounded result in the difficult cases (i.e., solving the Table Maker’s Dilemma) is equivalent to the computation of an accurate sum to a 1-bit target precision. This paper is based on the revision 11319 of sum.c in the trunk of the MPFR repository for MPFR 4.0 (not released yet)

OVERVIEW OF GMP AND GNU MPFR
NEW ALGORITHM AND IMPLEMENTATION
Preliminary Steps
Notes:
Back to the Generic Case
Complement and Rounding
Exponent Consideration
A Simplified Example
Worst-Case Complexity
TESTING
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call