Floating point tables and links

By Wolfgang Keller
Draft
Originally published 2023-08-15
Last modified 2024-05-01

Basics

In IEEE 754, a binary non-denormalized 16/32/64 bit floating point number consists of

\(1\) sign bit,
\(n_e\) exponent bits,
\(n_s\) significand bits of which \(n_s\) (the fractional part) are explicitly stored. Note that the significand precision is \(n_s+1\) bits (the \(n_s\) significand bits plus a leading 1).

where \(1 + n_e+n_s = n\) with \(n\): number of bits.

For binary16 (half-precision) numbers, the values for \(n_e\) and \(n_s\) are:
- \(n_e = 5\)
- \(n_s = 10\)
For binary32 (single-precision) numbers, the values for \(n_e\) and \(n_s\) are:
- \(n_e = 8\)
- \(n_s = 23\)
For binary64 (double-precision) numbers, the values for \(n_e\) and \(n_s\) are:
- \(n_e = 11\)
- \(n_s = 52\)

Let

\(s\) denote the value of the sign bit (\(s \in \{0,1\}\))
\(e\) denote the value of the exponent (\(e \in \{0, \ldots, 2^{n_e}-1\}\))
\(f\) denote the value of the fraction (\(f \in \{0, \ldots, 2^{n_s}-1\}\))

Then:

If \(e \in \{1, \ldots, 2^{n_e}-2\}\), a normal value is encoded: \begin{equation} (-1)^s \cdot 2^{e-(2^{n_e-1}-1)} \cdot (1 + 2^{-n_s} \cdot f). \label{eq:normal} \end{equation}
If \(e = 0\) and …
- … \(f = 0\), \(\pm 0\) is encoded.
- … \(f \neq 0\), a subnormal number is encoded: \begin{equation} (-1)^s \cdot 2^{-(2^{n_e-1}-2)} \cdot 2^{-n_s} \cdot f = (-1)^s \cdot 2^{-(2^{n_e-1} + n_s) + 2} \cdot f. \label{eq:subnormal} \end{equation}
If \(e = 2^{n_e}-1\) and …
- … \(f = 0\), \(\pm \infty\) is encoded.
- … \(f \neq 0\), a NaN (sNan (signalling NaN), qNan (quiet NaN) is encoded: in the IEEE 754-2008 and IEEE 754-2019 standards, the following requirement is defined for encoding a signaling/quiet NaN:
  - if \(f_{n_s-1} = 0\), the NaN is signaling (sNaN),
  - if \(f_{n_s-1} = 1\), the NaN is quiet (qNaN).

The number \(2^{n_e-1}-1\) occuring in the exponent of \(2^{e-(2^{n_e-1}-1)}\) in \(\eqref{eq:normal}\) is called the exponent bias.

Interesting side effects of this encoding are the following:

TODO

Important floating point numbers

Let us tabulate some important floating point numbers in the binary32 and binary64 format:

Simple normal numbers:
- \(\pm 1.0\)
- \(\pm 0.5\)
- \(\pm 2.0\)
Zero: \(\pm 0\)
Normal numbers:
- the largest number less than \(1\):
- the smallest number larger than \(1\):
- the smallest positive normal number:
- the smallest integral number \(w\) such that all normal floating point numbers \(w' \geq w\) are integral:
- the largest (normal) number that is smaller than \(w\):
- the smallest positive integral (normal) number \(x\) such that \(x+1\) cannot be represented as floating point number of the respective type:
- \(x-1\):
- \(x+2\):
- the largest normal number:
Subnormal numbers:
- the smallest positive subnormal number: \(2^{-(2^{n_e-1} + n_s) + 2}\)
- the largest subnormal number: \(2^{-(2^{n_e-1} + n_s) + 2} \cdot (2^{n_s}-1)\)
Infinity: \(\pm \infty\)
NaNs (with \(s\) set to \(0\)):
- sNaNs:
  - sNaN with minimum possible value for \(f\) (typical encoding on most processors, such as x86 and ARM processors)
  - sNaN with maximum possible value for \(f\)
- qNaNs:
  - qNaN with minimum possible value for \(f\)
  - qNaN with typical encoding on most processors, such as x86 and ARM processors
  - qNaN with maximum possible value for \(f\)

binary16 format

TODO

binary32 format

Description	Number	Value	Binary encoding	Hexadecimal encoding
\(+1.0\)	\(2^0\)	+1*10^0	0 01111111 00000000000000000000000	3F 80 00 00
\(-1.0\)	\(-2^0\)	-1*10^0	1 01111111 00000000000000000000000	BF 80 00 00
\(+0.5\)	\(2^{-1}=\frac{1}{2}\)	+5*10^-1	0 01111110 00000000000000000000000	3F 00 00 00
\(-0.5\)	\(-2^{-1}=-\frac{1}{2}\)	-5*10^-1	0 11111110 00000000000000000000000	BF 00 00 00
\(+2.0\)	\(2^1\)	+2*10^0	0 10000000 00000000000000000000000	40 00 00 00
\(-2.0\)	\(-2^1\)	-2*10^0	1 10000000 00000000000000000000000	C0 00 00 00
\(+0\)		+0	0 00000000 00000000000000000000000	00 00 00 00
\(-0\)		-0	1 00000000 00000000000000000000000	80 00 00 00
largest number less than \(1\)
smallest number larger than \(1\)
smallest positive normal number
smallest integral \(w\) such that all normal \(w' \geq w\) are integral
largest number that is smaller than \(w\)
smallest positive integral \(x\) such that \(x+1\) cannot be represented
\(x-1\)
\(x+2\)
largest normal number
smallest positive subnormal number	\(2^{-149} = \frac{1}{713\_623\_846\_352\_979\_940\_529\_142\_984\_724\_747\_568\_191\_373\_312}\)	+1.40129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125*10^-45	0 00000000 00000000000000000000001	00 00 00 01
largest subnormal number			0 00000000 11111111111111111111111	00 7F FF FF
\(+\infty\)		+Inf	0 11111111 00000000000000000000000	7F 80 00 00
\(-\infty\)		-Inf	1 11111111 00000000000000000000000	FF 80 00 00
sNaN with minimum possible value for \(f\) (typical encoding)		sNaN	0 11111111 00000000000000000000001	7F 80 00 01
sNaN with maximum possible value for \(f\)			0 11111111 01111111111111111111111	7F BF FF FF
qNaN with minimum possible value for \(f\)			0 11111111 10000000000000000000000	7F C0 00 00
qNaN with typical encoding		qNaN	0 11111111 10000000000000000000001	7F C0 00 01
qNaN with maximum possible value for \(f\)			0 11111111 11111111111111111111111	7F FF FF FF

binary64 format