Basics of Floating–Point Quantization

Bernard Widrow,István Kollár

doi:10.1017/cbo9780511754661.014

Abstract

Representation of physical quantities in terms of floating–point numbers allows one to cover a very wide dynamic range with a relatively small number of digits. Given this type of representation, roundoff errors are roughly proportional to the amplitude of the represented quantity. In contrast, roundoff errors with uniform quantization are bounded between ± q /2 and are not in any way proportional to the represented quantity. Floating–point is in most cases so advantageous over fixed–point number representation that it is rapidly becoming ubiquitous. The movement toward usage of floating–point numbers is accelerating as the speed of floating–point calculation is increasing and the cost of implementation is going down. For this reason, it is essential to have a method of analysis for floating–point quantization and floating–point arithmetic. THE FLOATING–POINT QUANTIZER Binary numbers have become accepted as the basis for all digital computation. We therefore describe floating–point representation in terms of binary numbers. Other number bases are completely possible, such as base 10 or base 16, but modern digital hardware is built on the binary base. The numbers in the table of Fig. 12.1 are chosen to provide a simple example. We begin by counting with nonnegative binary floating–point numbers as illustrated in Fig. 12.1. The counting starts with the number 0, represented here by 00000. Each number is multiplied by 2 E , where E is an exponent. Initially, let E = 0. Continuing the count, the next number is 1, represented by 00001, and so forth.

Full Text