F107 and F161 High-Precision Floating-Point Data Types
This page describes my work on algorithms for extended-precision floating point.
The names "f107" and "f161" in the title are the typedef names I have chosen for these types. They reflect the amount of precision one expects to get from the datatypes 107 and 161 bits of mantissa respectively. For consistency I also declare typedefs "f24" and "f53" for the standard IEEE single and double.
In order to communicate clearly, the symbols ⊕, ⊖ and ⊗ will be used to represent the result (after rounding) of a machine's built-in hardware floating-point add, subtract and multiply operations. When +, - and × appear without a circle, I am referring to a normal exact operation. This use of symbols will probably be a bit easier for mathematicians than for computer programmers. (-:
The techniques described here have been in use at least as far back as 1971, when Dekker described them in a paper that has been widely cited by later authors.
The core insight of the technique is that native hardware-supported operations can be used to quickly compute the exact value of error resulting from roundoff in an addition (or subtraction) operation. A corollary to this insight is that a similar technique can be used to slice and combine the digits of partial sums and products, retaining any desired number of the most significant digits and discarding the rest.
The technique is efficient enough to use on anything that already has some kind of floating-point, including the tiny CPU cores in a massively parallel GPU.
The term overlap is used to refer to any digit(s) that have the same significance. For example, consider the addition 123+4.56. the digit "3" in 123 overlaps the digit "4" in 4.56, because both have a significance of 100. The same thing can also occur in binary digits.
The Knuth Relation
The method is expressed concisely by Knuth(, section 4.2.2, theorem C) with this formula:
u + v = (u ⊕ v) + ((u ⊖ u') ⊕ (v ⊖ v'')
u' = (u ⊕ v) ⊖ v
v'' = (u ⊕ v) ⊖ u'
This formula states that the sum on the left is always exactly equal to the one on the right. Each side of the equation has exactly one precise unrounded sum "+"; the right side also has several rounded sum and difference operations. In addition, there is no overlap in the two pieces being added on the right, although there might be overlap on the left. (I call this the Knuth Relation not because Knuth gets credit for inventing it, but because he is cited by many others in papers and in computer source code.)
A deconstruction, with an example, is useful to illustrate what is happening. For the example, imagine that we are using a decimal calculator that displays only 4 digits, and all calculations are rounded to 4 digits. Let u be 999.1 and v be 2.222. There are two digits of overlap; the sum would take six digits (4 plus 4 minus the two digits that overlap) but because of carry propagation the exact sum (which is 1001.322) requires seven.
u ⊕ v, the rounded sum produced by the hardware (In our example, this will be 1001)
u', the twice-rounded result of adding and then subtracting v from u. (In the example this will be 999.0)
v'', the amount of v that is included in the rounded sum u ⊕ v. (In the example, this is 2.000)
u ⊖ u', the amount of u that was lost in the calculation of u ⊕ v. (In our example, this is .1000)
v ⊖ v'', the amount of v that was lost in the calculation of u ⊕ v. (In our example, this is .2220)
((u ⊖ u') ⊕ (v ⊖ v''), the exact amount of the roundoff that occurred in the calculation of u ⊕ v. (In our example, this is .3220)
Together, the two values (u ⊕ v) and ((u ⊖ u') ⊕ (v ⊖ v'') have the same exact sum as u + v. Furthermore, these two parts have no "digits of overlap". They might have opposite sign.
Most of the cited authors prove that the formula correctly produces an exact result in all cases. There are three variations that need to be considered (depending on the sign of the roundoff and whether or not u is equal to u'), and I will leave it to the reader to discover the proof or consult the references.
Why 107 Bits and not 106=2×53 ?
The standard IEEE 754 64-bit floating-point type provides 53 bits of precision in the mantissa. When building a doubled-precision addition operation using the relation just described, we get 107 bits of mantissa precision one bit more than 2×53. To some this may be a surprise where does the extra bit come from?
The extra bit comes from the sign of the lower-precision component. To illustrate how it works, imagine that the upper component is limited to being one of the four values (4,5,6,7), and the lower component has four possible values plus a sign, or can be zero: (-4/8, -3/8, -2/8, -1/8, 0, 1/8, 2/8, 3/8, 4/8).
4 - 4/8
5 + 1/8
5 + 2/8
5 + 3/8
5 + 4/8
6 - 3/8
6 - 2/8
6 - 1/8
6 + 0
6 + 1/8
6 + 2/8
7 + 4/8
Counting up all the combinations, there are 33≈25 values that can be expressed. The sign bit contributes half of these values; without it the fractional part would have to go in increments of 1/4 and we would only get 16 or 17 values.
Building Blocks for the 107-bit Operations
TWO-SUM and add_112
A practical application of the above formula is a set of operations that will take two single-precision quantities that might or might not overlap, and produce two single-precision quantities that do not overlap and whose precise sum is equal to the precise sum of the original two quantities.
This requires six rounded operations, and is called TWO-SUM in most of the literature. I call it add_112 for consistency with the names of other routines described below.
// add two doubles, return a two-component (f107) result.
add_112(a: f53; b: f53)
t0 = a ⊕ b
v = t0 ⊖ a
t1 = (b ⊖ v) ⊕ (a ⊖ (t0 ⊖ v))
QUICK-TWO-SUM and normalize_22
The above formula works for any pair of addends a and b, regardless of which is larger. If you already know which is larger, you can compute the result using only three operations instead of six. For reasons that will be more clear below, I call it normalize_22.
// renormalize two components, with two-component (f107) output.
normalize_22(t0: f53; t1: f53)
s0 = t0 ⊕ t1
s1 = t1 ⊖ (s0 ⊖ t0)
SPLIT and split27
SPLIT is another building block, used for multiplication. It is described by Priest,, and proven by Shewchuk. It is possible to split a value into two parts that have less than or equal to half the precision, even when the original type has an odd number of bits because the lower half can be of opposite sign.
// split an f53 into two components hi and lo, each with 26 digits of
|// precision (note that about half of the time, hi and lo will be of|
|// opposite sign)|
y = (227+1) ⊗ x
z = y ⊖ x
hi = y ⊖ z
lo = x ⊖ hi
The size of the two parts hi and lo will be determined by the constant used in the first calculation. In general, it should be 2s+1 where s is at least half of the precision (in other words, divide the precision by 2, and round up if the precision is odd). For f53, s is 27 as shown above. For f24 (IEEE single) s should be 12; the constant is 212+1, hi will have 12 bits and lo will have 11. For f64 (the IEEE extended or "long double" type supported by Pentium-compatible processors) s should be 32, hi will have 32 bits and lo will have 31.
My f107 implementation in C++ includes:
- the basic operations (+, -, ×, /)
- faster versions of x2, 2N and xN for integer N
- absolute value, floor
- square root
- sine, cosine, tangent
- conversion to/from ASCII string
- exp, log, sinh, cosh, tanh
The source files are:
Here is source code for my f161 "triple-double" and f215 "quad-double" formats. These only has the four basic arithmetic functions (add, subtract, multiply, divide), plus an x^2 "square" function that is more efficient than x*x, and integer exponent. These are the only functions I have needed at these higher precisions.
See in particular Theorems B and C and exercise 21 in section 4.2.2 "Accuracy of Floating Point Arithmetic".
 Briggs, K. (1998) Doubledouble floating point arithmetic, web pages with source code and documentation. Originally at http://keithbriggs.info/software.html, now no longer online, buta mirror of the original 1998 version is at: The doubledouble homepage
 Li, X., et al. (2000). Design, implementation and testing of extended and mixed precision BLAS. Lawrence Berkeley National Laboratory. Technical Report LBNL-45991
 Hida, Y., et al. (2007) Library for double-double and quad-double arithmetic Similar to the 2000 paper, but removes the description of how to compile and test the library on different systems. DD and QD implementation is available for download.
 Andrew Thall (2006). Extended-Precision Floating-Point Numbers for GPU Computation. one-page layout summary of  (for SIGGRAPH 2006) showing Mandelbrot calculations rendered in 48-bit precision on a GPU that only provides 23-bit precision. PDF here
 Andrew Thall (2007, amended 2009). Extended-Precision Floating-Point Numbers for GPU Computation. Adapts the double and quad algorithms from Shewchuk, Hida et al. to the Nvidia Cg language for GPUs (achieving roughly 96 bits of mantissa precision). Includes high-precision sqrt, sin, cos, ln and exp functions. PDF here
This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2019 Apr 03. s.11