scroll identifier for mobile
main-content

## Über dieses Buch

This handbook is a definitive guide to the effective use of modern floating-point arithmetic, which has considerably evolved, from the frequently inconsistent floating-point number systems of early computing to the recent IEEE 754-2008 standard. Most of computational mathematics depends on floating-point numbers, and understanding their various implementations will allow readers to develop programs specifically tailored for the standard’s technical features. Algorithms for floating-point arithmetic are presented throughout the book and illustrated where possible by example programs which show how these techniques appear in actual coding and design.
The volume itself breaks its core topic into four parts: the basic concepts and history of floating-point arithmetic; methods of analyzing floating-point algorithms and optimizing them; implementations of IEEE 754-2008 in hardware and software; and useful extensions to the standard floating-point system, such as interval arithmetic, double- and triple-word arithmetic, operations on complex numbers, and formal verification of floating-point algorithms. This new edition updates chapters to reflect recent changes to programming languages and compilers and the new prevalence of GPUs in recent years. The revisions also add material on fused multiply-add instruction, and methods of extending the floating-point precision.
As supercomputing becomes more common, more numerical engineers will need to use number representation to account for trade-offs between various parameters, such as speed, accuracy, and energy consumption. The Handbook of Floating-Point Arithmetic is designed for students and researchers in numerical analysis, programmers of numerical algorithms, compiler designers, and designers of arithmetic operators.

## Inhaltsverzeichnis

### Chapter 1. Introduction

Representing and manipulating real numbers efficiently is required in many fields of science, engineering, finance, and more. Since the early years of electronic computing, many different ways of approximating real numbers on computers have been introduced. One can cite (this list is far from being exhaustive): fixed-point arithmetic, logarithmic [337, 585] and semi-logarithmic [444] number systems, intervals [428], continued fractions [349, 622], rational numbers [348] and possibly infinite strings of rational numbers [418], level-index number systems [100, 475], fixed-slash and floating-slash number systems [412], tapered floating-point arithmetic [432, 22], 2-adic numbers [623], and most recently unums and posits [228, 229].
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 2. Definitions and Basic Notions

As stated in the introduction, roughly speaking, a radix-β floating-point number x is a number of the form
$$\displaystyle{m \cdot \beta ^{e},}$$
where β is the radix of the floating-point system, m such that | m | < β is the significand of x, and e is its exponent.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 3. Floating-Point Formats and Environment

Our main focus in this chapter is the IEEE 754-2008 Standard for Floating-Point Arithmetic [267] , a revision and merge of the earlier IEEE 754-1985 [12] and IEEE 854-1987 [13] standards. A paper written in 1981 by Kahan, Why Do We Need a Floating-Point Standard? [315], depicts the rather messy situation of floating-point arithmetic before the 1980s. Anybody who takes the view that the current standard is too constraining and that circuit and system manufacturers could build much more efficient machines without it should read that paper and think about it. Even if there were at that time a few reasonably good environments, the various systems available then were so different that writing portable yet reasonably efficient numerical software was extremely difficult. For instance, as pointed out in [553], sometimes a programmer had to insert multiplications by 1. 0 to make a program work reliably.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 4. Basic Properties and Algorithms

In this chapter, we present some short yet useful algorithms and some basic properties that can be derived from specifications of floating-point arithmetic systems, such as the ones given in the successive IEEE 754 standards. Thanks to these standards, we now have an accurate definition of floating-point formats and operations. The behavior of a sequence of operations becomes at least partially for more details on this). We therefore can build algorithms and proofs that refer to these specifications.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 5. Enhanced Floating-Point Sums, Dot Products, and Polynomial Values

In this chapter, we focus on the computation of sums and dot products, and on the evaluation of polynomials in IEEE 754 floating-point arithmetic. Such calculations arise in many fields of numerical computing. Computing sums is required, e.g., in numerical integration and the computation of means and variances. Dot products appear everywhere in numerical linear algebra. Polynomials are used to approximate many functions (see Chapter 10).
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 6. Languages and Compilers

The previous chapters have given an overview of interesting properties and algorithms that can be built on an IEEE 754-compliant floating-point arithmetic. In this chapter, we discuss the practical issues encountered when trying to implement such algorithms in actual computers using actual programming languages. In particular, we discuss the relationship between standard compliance, portability, accuracy, and performance. This chapter is useful to programmers wishing to obtain a standard-compliant behavior from their programs, but it is also useful for understanding how performance may be improved by relaxing standard compliance and also what traps one may fall into.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 7. Algorithms for the Basic Operations

Among the many operations that the IEEE 754 standards specify (see Chapter 3), we will focus here and in the next two chapters on the five basic arithmetic operations: addition, subtraction, multiplication, division, and square root. We will also study the fused multiply-add (FMA) operator. We review here some of the known properties and algorithms used to implement each of those operators. Chapter 8 and Chapter 9 will detail some examples of actual implementations in, respectively, hardware and software.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 8. Hardware Implementation of Floating-Point Arithmetic

Chapter 7 has shown that operations on floating-point numbers are naturally expressed in terms of integer or fixed-point operations on the significand and the exponent. For instance, to obtain the product of two floating-point numbers, one basically multiplies the significands and adds the exponents. However, obtaining the correct rounding of the result may require considerable design effort and the use of nonarithmetic primitives such as leading-zero counters and shifters. This chapter details the implementation of these algorithms in hardware, using digital logic.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 9. Software Implementation of Floating-Point Arithmetic

The previous chapter has presented the basic paradigms used for implementing floating-point arithmetic in hardware. However, some processors may not have such dedicated hardware, mainly for cost reasons. When it is necessary to handle floating-point numbers on such processors, one solution is to implement floating-point arithmetic in software.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 10. Evaluating Floating-Point Elementary Functions

The elementary functions (the right term is “elementary transcendental functions”) are the most common mathematical functions: sine, cosine, tangent, and their inverses, exponentials and logarithms of radices e, 2, or 10, etc. They appear everywhere in scientific computing. Therefore, being able to evaluate them quickly and accurately is important for many applications. Many very different methods have been used for evaluating them: polynomial or rational approximations, shift-and-add algorithms, table-based methods, etc.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 11. Complex Numbers

Complex numbers naturally appear in many domains (such as electromagnetism, quantum mechanics, and relativity). It is of course always possible to express the various calculations that use complex numbers in terms of real numbers only. However, this will frequently result in programs that are larger and less clear. A good complex arithmetic would make numerical programs devoted to these problems easier to design, understand, and debug.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 12. Interval Arithmetic

The automation of the a posteriori analysis of floating-point error cannot be done in a perfect way (except possibly in straightforward or specific cases), yielding exactly the roundoff error. However, an approach based on interval arithmetic can provide results with a more or less satisfactory quality and with more or less efforts to obtain them. This is a historical reason for introducing interval arithmetic, as stated in the preface of R. Moore’s PhD dissertation [427]: “In a hour’s time a modern high-speed stored-program digital computer can perform arithmetic computations which would take a “hand-computer” equipped with a desk calculator five years to do. In setting out a five year computing project, a hand computer would be justifiably (and very likely gravely) concerned over the extent to which errors were going to accumulate—not mistakes, which he will catch by various checks on his work—but errors due to rounding” and discretization and truncation errors.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 13. Verifying Floating-Point Algorithms

While the previous chapters have made clear that it is common practice to verify floating-point algorithms with pen-and-paper proofs, this practice can lead to subtle bugs. Indeed, floating-point arithmetic introduces numerous special cases, and examining all the details would be tedious. As a consequence, the verification process tends to focus on the main parts of the correctness proof, so that it does not grow out of reach.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Chapter 14. Extending the Precision

Though satisfactory in most situations, the fixed-precision floating-point formats that are available in hardware or software in our computers may sometimes prove insufficient. There are reasonably rare cases when the binary64/decimal64 or binary128/decimal128 floating-point numbers of the IEEE 754 standard are too crude as approximations of the real numbers. Also, at the time of writing these lines, the binary128 and decimal128 formats are very seldom implemented in hardware.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres

### Backmatter

Weitere Informationen

## BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

## Whitepaper

- ANZEIGE -

### Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.