| @cindex IEEE floating point |
| |
| This chapter describes functions for examining the representation of |
| floating point numbers and controlling the floating point environment of |
| your program. The functions described in this chapter are declared in |
| the header file @file{gsl_ieee_utils.h}. |
| |
| @menu |
| * Representation of floating point numbers:: |
| * Setting up your IEEE environment:: |
| * IEEE References and Further Reading:: |
| @end menu |
| |
| @node Representation of floating point numbers |
| @section Representation of floating point numbers |
| @cindex IEEE format for floating point numbers |
| @cindex bias, IEEE format |
| @cindex exponent, IEEE format |
| @cindex sign bit, IEEE format |
| @cindex mantissa, IEEE format |
| The IEEE Standard for Binary Floating-Point Arithmetic defines binary |
| formats for single and double precision numbers. Each number is composed |
| of three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent} |
| (@math{E}) and a @dfn{fraction} (@math{f}). The numerical value of the |
| combination @math{(s,E,f)} is given by the following formula, |
| @tex |
| \beforedisplay |
| $$ |
| (-1)^s (1 \cdot fffff\dots) 2^E |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| (-1)^s (1.fffff...) 2^E |
| @end example |
| |
| @end ifinfo |
| @noindent |
| @cindex normalized form, IEEE format |
| @cindex denormalized form, IEEE format |
| The sign bit is either zero or one. The exponent ranges from a minimum value |
| @c{$E_{min}$} |
| @math{E_min} |
| to a maximum value |
| @c{$E_{max}$} |
| @math{E_max} depending on the precision. The exponent is converted to an |
| unsigned number |
| @math{e}, known as the @dfn{biased exponent}, for storage by adding a |
| @dfn{bias} parameter, |
| @c{$e = E + \hbox{\it bias}$} |
| @math{e = E + bias}. |
| The sequence @math{fffff...} represents the digits of the binary |
| fraction @math{f}. The binary digits are stored in @dfn{normalized |
| form}, by adjusting the exponent to give a leading digit of @math{1}. |
| Since the leading digit is always 1 for normalized numbers it is |
| assumed implicitly and does not have to be stored. |
| Numbers smaller than |
| @c{$2^{E_{min}}$} |
| @math{2^(E_min)} |
| are be stored in @dfn{denormalized form} with a leading zero, |
| @tex |
| \beforedisplay |
| $$ |
| (-1)^s (0 \cdot fffff\dots) 2^{E_{min}} |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| (-1)^s (0.fffff...) 2^(E_min) |
| @end example |
| |
| @end ifinfo |
| @noindent |
| @cindex zero, IEEE format |
| @cindex infinity, IEEE format |
| This allows gradual underflow down to |
| @c{$2^{E_{min} - p}$} |
| @math{2^(E_min - p)} for @math{p} bits of precision. |
| A zero is encoded with the special exponent of |
| @c{$2^{E_{min}-1}$} |
| @math{2^(E_min - 1)} and infinities with the exponent of |
| @c{$2^{E_{max}+1}$} |
| @math{2^(E_max + 1)}. |
| |
| @noindent |
| @cindex single precision, IEEE format |
| The format for single precision numbers uses 32 bits divided in the |
| following way, |
| |
| @smallexample |
| seeeeeeeefffffffffffffffffffffff |
| |
| s = sign bit, 1 bit |
| e = exponent, 8 bits (E_min=-126, E_max=127, bias=127) |
| f = fraction, 23 bits |
| @end smallexample |
| |
| @noindent |
| @cindex double precision, IEEE format |
| The format for double precision numbers uses 64 bits divided in the |
| following way, |
| |
| @smallexample |
| seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff |
| |
| s = sign bit, 1 bit |
| e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023) |
| f = fraction, 52 bits |
| @end smallexample |
| |
| @noindent |
| It is often useful to be able to investigate the behavior of a |
| calculation at the bit-level and the library provides functions for |
| printing the IEEE representations in a human-readable form. |
| |
| @comment float vs double vs long double |
| @comment (how many digits are available for each) |
| |
| @deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x}) |
| @deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x}) |
| These functions output a formatted version of the IEEE floating-point |
| number pointed to by @var{x} to the stream @var{stream}. A pointer is |
| used to pass the number indirectly, to avoid any undesired promotion |
| from @code{float} to @code{double}. The output takes one of the |
| following forms, |
| |
| @table @code |
| @item NaN |
| the Not-a-Number symbol |
| |
| @item Inf, -Inf |
| positive or negative infinity |
| |
| @item 1.fffff...*2^E, -1.fffff...*2^E |
| a normalized floating point number |
| |
| @item 0.fffff...*2^E, -0.fffff...*2^E |
| a denormalized floating point number |
| |
| @item 0, -0 |
| positive or negative zero |
| |
| @comment @item [non-standard IEEE float], [non-standard IEEE double] |
| @comment an unrecognized encoding |
| @end table |
| |
| The output can be used directly in GNU Emacs Calc mode by preceding it |
| with @code{2#} to indicate binary. |
| @end deftypefun |
| |
| @deftypefun void gsl_ieee_printf_float (const float * @var{x}) |
| @deftypefunx void gsl_ieee_printf_double (const double * @var{x}) |
| These functions output a formatted version of the IEEE floating-point |
| number pointed to by @var{x} to the stream @code{stdout}. |
| @end deftypefun |
| |
| @noindent |
| The following program demonstrates the use of the functions by printing |
| the single and double precision representations of the fraction |
| @math{1/3}. For comparison the representation of the value promoted from |
| single to double precision is also printed. |
| |
| @example |
| @verbatiminclude examples/ieee.c |
| @end example |
| |
| @noindent |
| The binary representation of @math{1/3} is @math{0.01010101... }. The |
| output below shows that the IEEE format normalizes this fraction to give |
| a leading digit of 1, |
| |
| @smallexample |
| f= 1.01010101010101010101011*2^-2 |
| fd= 1.0101010101010101010101100000000000000000000000000000*2^-2 |
| d= 1.0101010101010101010101010101010101010101010101010101*2^-2 |
| @end smallexample |
| |
| @noindent |
| The output also shows that a single-precision number is promoted to |
| double-precision by adding zeros in the binary representation. |
| |
| @comment importance of using 1.234L in long double calculations |
| |
| @comment @example |
| @comment int main (void) |
| @comment @{ |
| @comment long double x = 1.0, y = 1.0; |
| |
| @comment x = x + 0.2; |
| @comment y = y + 0.2L; |
| |
| @comment printf(" d %.20Lf\n",x); |
| @comment printf("ld %.20Lf\n",y); |
| |
| @comment return 1; |
| @comment @} |
| |
| @comment d 1.20000000000000001110 |
| @comment ld 1.20000000000000000004 |
| @comment @end example |
| |
| |
| @node Setting up your IEEE environment |
| @section Setting up your IEEE environment |
| @cindex IEEE exceptions |
| @cindex precision, IEEE arithmetic |
| @cindex rounding mode |
| @cindex arithmetic exceptions |
| @cindex exceptions, IEEE arithmetic |
| @cindex division by zero, IEEE exceptions |
| @cindex underflow, IEEE exceptions |
| @cindex overflow, IEEE exceptions |
| The IEEE standard defines several @dfn{modes} for controlling the |
| behavior of floating point operations. These modes specify the important |
| properties of computer arithmetic: the direction used for rounding (e.g. |
| whether numbers should be rounded up, down or to the nearest number), |
| the rounding precision and how the program should handle arithmetic |
| exceptions, such as division by zero. |
| |
| Many of these features can now be controlled via standard functions such |
| as @code{fpsetround}, which should be used whenever they are available. |
| Unfortunately in the past there has been no universal API for |
| controlling their behavior---each system has had its own low-level way |
| of accessing them. To help you write portable programs GSL allows you |
| to specify modes in a platform-independent way using the environment |
| variable @code{GSL_IEEE_MODE}. The library then takes care of all the |
| necessary machine-specific initializations for you when you call the |
| function @code{gsl_ieee_env_setup}. |
| |
| @deftypefun void gsl_ieee_env_setup () |
| This function reads the environment variable @code{GSL_IEEE_MODE} and |
| attempts to set up the corresponding specified IEEE modes. The |
| environment variable should be a list of keywords, separated by |
| commas, like this, |
| |
| @display |
| @code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..." |
| @end display |
| |
| @noindent |
| where @var{keyword} is one of the following mode-names, |
| |
| @itemize @asis |
| @item |
| @code{single-precision} |
| @item |
| @code{double-precision} |
| @item |
| @code{extended-precision} |
| @item |
| @code{round-to-nearest} |
| @item |
| @code{round-down} |
| @item |
| @code{round-up} |
| @item |
| @code{round-to-zero} |
| @item |
| @code{mask-all} |
| @item |
| @code{mask-invalid} |
| @item |
| @code{mask-denormalized} |
| @item |
| @code{mask-division-by-zero} |
| @item |
| @code{mask-overflow} |
| @item |
| @code{mask-underflow} |
| @item |
| @code{trap-inexact} |
| @item |
| @code{trap-common} |
| @end itemize |
| |
| If @code{GSL_IEEE_MODE} is empty or undefined then the function returns |
| immediately and no attempt is made to change the system's IEEE |
| mode. When the modes from @code{GSL_IEEE_MODE} are turned on the |
| function prints a short message showing the new settings to remind you |
| that the results of the program will be affected. |
| |
| If the requested modes are not supported by the platform being used then |
| the function calls the error handler and returns an error code of |
| @code{GSL_EUNSUP}. |
| |
| When options are specified using this method, the resulting mode is |
| based on a default setting of the highest available precision (double |
| precision or extended precision, depending on the platform) in |
| round-to-nearest mode, with all exceptions enabled apart from the |
| @sc{inexact} exception. The @sc{inexact} exception is generated |
| whenever rounding occurs, so it must generally be disabled in typical |
| scientific calculations. All other floating-point exceptions are |
| enabled by default, including underflows and the use of denormalized |
| numbers, for safety. They can be disabled with the individual |
| @code{mask-} settings or together using @code{mask-all}. |
| |
| The following adjusted combination of modes is convenient for many |
| purposes, |
| |
| @example |
| GSL_IEEE_MODE="double-precision,"\ |
| "mask-underflow,"\ |
| "mask-denormalized" |
| @end example |
| |
| @noindent |
| This choice ignores any errors relating to small numbers (either |
| denormalized, or underflowing to zero) but traps overflows, division by |
| zero and invalid operations. |
| @end deftypefun |
| |
| @noindent |
| To demonstrate the effects of different rounding modes consider the |
| following program which computes @math{e}, the base of natural |
| logarithms, by summing a rapidly-decreasing series, |
| @tex |
| \beforedisplay |
| $$ |
| e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots |
| = 2.71828182846... |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| e = 1 + 1/2! + 1/3! + 1/4! + ... |
| = 2.71828182846... |
| @end example |
| @end ifinfo |
| |
| @example |
| @verbatiminclude examples/ieeeround.c |
| @end example |
| |
| @noindent |
| Here are the results of running the program in @code{round-to-nearest} |
| mode. This is the IEEE default so it isn't really necessary to specify |
| it here, |
| |
| @example |
| $ GSL_IEEE_MODE="round-to-nearest" ./a.out |
| i= 1 sum=1.000000000000000000 error=-1.71828 |
| i= 2 sum=2.000000000000000000 error=-0.718282 |
| .... |
| i=18 sum=2.718281828459045535 error=4.44089e-16 |
| i=19 sum=2.718281828459045535 error=4.44089e-16 |
| @end example |
| |
| @noindent |
| After nineteen terms the sum converges to within @c{$4 \times 10^{-16}$} |
| @math{4 \times 10^-16} of the correct value. |
| If we now change the rounding mode to |
| @code{round-down} the final result is less accurate, |
| |
| @example |
| $ GSL_IEEE_MODE="round-down" ./a.out |
| i= 1 sum=1.000000000000000000 error=-1.71828 |
| .... |
| i=19 sum=2.718281828459041094 error=-3.9968e-15 |
| @end example |
| |
| @noindent |
| The result is about |
| @c{$4 \times 10^{-15}$} |
| @math{4 \times 10^-15} |
| below the correct value, an order of magnitude worse than the result |
| obtained in the @code{round-to-nearest} mode. |
| |
| If we change to rounding mode to @code{round-up} then the series no |
| longer converges (the reason is that when we add each term to the sum |
| the final result is always rounded up. This is guaranteed to increase the sum |
| by at least one tick on each iteration). To avoid this problem we would |
| need to use a safer converge criterion, such as @code{while (fabs(sum - |
| oldsum) > epsilon)}, with a suitably chosen value of epsilon. |
| |
| Finally we can see the effect of computing the sum using |
| single-precision rounding, in the default @code{round-to-nearest} |
| mode. In this case the program thinks it is still using double precision |
| numbers but the CPU rounds the result of each floating point operation |
| to single-precision accuracy. This simulates the effect of writing the |
| program using single-precision @code{float} variables instead of |
| @code{double} variables. The iteration stops after about half the number |
| of iterations and the final result is much less accurate, |
| |
| @example |
| $ GSL_IEEE_MODE="single-precision" ./a.out |
| .... |
| i=12 sum=2.718281984329223633 error=1.5587e-07 |
| @end example |
| |
| @noindent |
| with an error of |
| @c{$O(10^{-7})$} |
| @math{O(10^-7)}, which corresponds to single |
| precision accuracy (about 1 part in @math{10^7}). Continuing the |
| iterations further does not decrease the error because all the |
| subsequent results are rounded to the same value. |
| |
| @node IEEE References and Further Reading |
| @section References and Further Reading |
| |
| The reference for the IEEE standard is, |
| |
| @itemize @asis |
| @item |
| ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic. |
| @end itemize |
| |
| @noindent |
| A more pedagogical introduction to the standard can be found in the |
| following paper, |
| |
| @itemize @asis |
| @item |
| David Goldberg: What Every Computer Scientist Should Know About |
| Floating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 1 |
| (March 1991), pages 5--48. |
| |
| Corrigendum: @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 3 (September |
| 1991), page 413. and see also the sections by B. A. Wichmann and Charles |
| B. Dunham in Surveyor's Forum: ``What Every Computer Scientist Should |
| Know About Floating-Point Arithmetic''. @cite{ACM Computing Surveys}, |
| Vol.@: 24, No.@: 3 (September 1992), page 319. |
| @end itemize |
| |
| @noindent |
| |
| A detailed textbook on IEEE arithmetic and its practical use is |
| available from SIAM Press, |
| |
| @itemize @asis |
| @item |
| Michael L. Overton, @cite{Numerical Computing with IEEE Floating Point Arithmetic}, |
| SIAM Press, ISBN 0898715717. |
| @end itemize |
| |
| @noindent |
| |
| @comment to turn on math exception handling use __setfpucw, see |
| @comment /usr/include/i386/fpu_control.h |
| @comment e.g. |
| @comment #include <math.h> |
| @comment #include <stdio.h> |
| @comment #include <fpu_control.h> |
| @comment double f (double x); |
| @comment int main () |
| @comment { |
| @comment double a = 0; |
| @comment double y, z; |
| @comment __setfpucw(0x1372); |
| @comment mention extended vs double precision on Pentium, and how to get around |
| @comment it (-ffloat-store, or selective use of volatile) |
| |
| @comment On the alpha the option -mieee is needed with gcc |
| @comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact |
| @comment and -ieee-with-inexact, or -fpe1 or -fpe2 |