By Wolfgang Keller
Draft
Originally published 2023-08-15
Last modified 2024-05-01
In IEEE 754, a binary non-denormalized 16/32/64 bit floating point number consists of
where \(1 + n_e+n_s = n\) with \(n\): number of bits.
Let
Then:
The number \(2^{n_e-1}-1\) occuring in the exponent of \(2^{e-(2^{n_e-1}-1)}\) in \(\eqref{eq:normal}\) is called the exponent bias.
Interesting side effects of this encoding are the following:
Let us tabulate some important floating point numbers in the binary32 and binary64 format:
TODO
Description | Number | Value | Binary encoding | Hexadecimal encoding |
---|---|---|---|---|
\(+1.0\) | \(2^0\) | +1*10^0 | 0 01111111 00000000000000000000000 | 3F 80 00 00 |
\(-1.0\) | \(-2^0\) | -1*10^0 | 1 01111111 00000000000000000000000 | BF 80 00 00 |
\(+0.5\) | \(2^{-1}=\frac{1}{2}\) | +5*10^-1 | 0 01111110 00000000000000000000000 | 3F 00 00 00 |
\(-0.5\) | \(-2^{-1}=-\frac{1}{2}\) | -5*10^-1 | 0 11111110 00000000000000000000000 | BF 00 00 00 |
\(+2.0\) | \(2^1\) | +2*10^0 | 0 10000000 00000000000000000000000 | 40 00 00 00 |
\(-2.0\) | \(-2^1\) | -2*10^0 | 1 10000000 00000000000000000000000 | C0 00 00 00 |
\(+0\) | +0 | 0 00000000 00000000000000000000000 | 00 00 00 00 | |
\(-0\) | -0 | 1 00000000 00000000000000000000000 | 80 00 00 00 | |
largest number less than \(1\) | ||||
smallest number larger than \(1\) | ||||
smallest positive normal number | ||||
smallest integral \(w\) such that all normal \(w' \geq w\) are integral | ||||
largest number that is smaller than \(w\) | ||||
smallest positive integral \(x\) such that \(x+1\) cannot be represented | ||||
\(x-1\) | ||||
\(x+2\) | ||||
largest normal number | ||||
smallest positive subnormal number | \(2^{-149} = \frac{1}{713\_623\_846\_352\_979\_940\_529\_142\_984\_724\_747\_568\_191\_373\_312}\) | +1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125*10^-45 | 0 00000000 00000000000000000000001 | 00 00 00 01 |
largest subnormal number | 0 00000000 11111111111111111111111 | 00 7F FF FF | ||
\(+\infty\) | +Inf | 0 11111111 00000000000000000000000 | 7F 80 00 00 | |
\(-\infty\) | -Inf | 1 11111111 00000000000000000000000 | FF 80 00 00 | |
sNaN with minimum possible value for \(f\) (typical encoding) | sNaN | 0 11111111 00000000000000000000001 | 7F 80 00 01 | |
sNaN with maximum possible value for \(f\) | 0 11111111 01111111111111111111111 | 7F BF FF FF | ||
qNaN with minimum possible value for \(f\) | 0 11111111 10000000000000000000000 | 7F C0 00 00 | ||
qNaN with typical encoding | qNaN | 0 11111111 10000000000000000000001 | 7F C0 00 01 | |
qNaN with maximum possible value for \(f\) | 0 11111111 11111111111111111111111 | 7F FF FF FF |
Description | Number | Value | Binary encoding | Hexadecimal encoding |
---|---|---|---|---|
\(+1.0\) | \(2^0\) | +1*10^0 | 0 01111111111 0000000000000000000000000000000000000000000000000000 | 3F F0 00 00 00 00 00 00 |
\(-1.0\) | \(-2^0\) | -1*10^0 | 1 01111111111 0000000000000000000000000000000000000000000000000000 | BF F0 00 00 00 00 00 00 |
\(+0.5\) | \(2^{-1}=\frac{1}{2}\) | +5*10^-1 | ||
\(-0.5\) | \(-2^{-1}=-\frac{1}{2}\) | -5*10^-1 | ||
\(+2.0\) | \(2^1\) | +2*10^0 | ||
\(-2.0\) | \(-2^1\) | -2*10^0 | ||
\(+0\) | +0 | 0 00000000000 0000000000000000000000000000000000000000000000000000 | 00 00 00 00 00 00 00 00 | |
\(-0\) | -0 | 1 00000000000 0000000000000000000000000000000000000000000000000000 | 80 00 00 00 00 00 00 00 | |
largest number less than \(1\) | ||||
smallest number larger than \(1\) | ||||
smallest positive normal number | ||||
smallest integral \(w\) such that all normal \(w' \geq w\) are integral | ||||
largest number that is smaller than \(w\) | ||||
smallest positive integral \(x\) such that \(x+1\) cannot be represented | ||||
\(x-1\) | ||||
\(x+2\) | ||||
largest normal number | ||||
smallest positive subnormal number | 0 00000000000 0000000000000000000000000000000000000000000000000001 | 00 00 00 00 00 00 00 01 | ||
largest subnormal number | 0 00000000000 1111111111111111111111111111111111111111111111111111 | 00 0F FF FF FF FF FF FF | ||
\(+\infty\) | +Inf | 0 11111111111 0000000000000000000000000000000000000000000000000000 | 7F F0 00 00 00 00 00 00 | |
\(-\infty\) | -Inf | 1 11111111111 0000000000000000000000000000000000000000000000000000 | FF F0 00 00 00 00 00 00 | |
sNaN with minimum possible value for \(f\) (typical encoding) | sNaN | 0 11111111111 0000000000000000000000000000000000000000000000000001 | 7F F0 00 00 00 00 00 01 | |
sNaN with maximum possible value for \(f\) | 0 11111111111 0111111111111111111111111111111111111111111111111111 | 7F F7 FF FF FF FF FF FF | ||
qNaN with minimum possible value for \(f\) | 0 11111111111 1000000000000000000000000000000000000000000000000000 | 7F F8 00 00 00 00 00 00 | ||
qNaN with typical encoding | qNaN | 0 11111111111 1000000000000000000000000000000000000000000000000001 | 7F F8 00 00 00 00 00 01 | |
qNaN with maximum possible value for \(f\) | 0 11111111111 1111111111111111111111111111111111111111111111111111 | 7F FF FF FF FF FF FF FF |