Files
eth-summaries/semester3/spca/parts/01_c/06_floating-point/02_representation.tex
2026-01-16 07:29:07 +01:00

47 lines
2.2 KiB
TeX

\subsubsection{Floating Point Representation}
Floating point numbers instead use the representation:
$$
a = \underbrace{(-1)^s}_\text{Sign} \cdot \underbrace{M}_\text{Mantissa} \cdot \underbrace{2^E}_\text{Exponent}
$$
Single precision and Double precision floating point numbers store the $3$ parameters in separate bit fields $s, e, m$:
\begin{center}
Single Precision:
\begin{tabular}{|c|c|c|}
\hline
$31$: Sign & $30-23$: Exponent & $22-0$: Mantissa \\
\hline
\end{tabular} \\
Bias: $127$, Exponent range: $[-126, 127]$
\end{center}
\begin{center}
Double Precision:
\begin{tabular}{|c|c|c|}
\hline
$63$: Sign & $62-52$: Exponent & $51-0$: Mantissa \\
\hline
\end{tabular}\\
Bias: $1023$, Exponent range: $[-1022, 1023]$
\end{center}
Most of the extra precision in $64$b floating point numbers is associated to the mantissa. Note how double precision is necessary to represent all $32$b signed Integers, and not all $64$b signed Integers can be represented in either format.
\newpage
The way these bitfields are interpretd \textit{differs} based on the exponent field $e$:
\begin{enumerate}
\item \textbf{Normalized Values}: Exponent bit field $e$ is neither all $1$s nor all $0$s.\\
In this case, $E$ is read in \textit{biased} form: $E = e - b$. The bias is $b=2^{k-1}-1$, where $k$ is the amount of bits reserved for $e$. This produces the exponent ranges $E \in [-(b-1), b]$.\\
The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0 + 1$, where $n$ is the amount of bits reserved for $m$
\item \textbf{Denormalized Values}: Exponent bit field $e$ is all $0$s.\\
In this case, $E$ is read in \textit{biased} form $E = 1 - b$. (Instead of $E = e - b$)\\
The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0$ (without adding $1$)
\item \textbf{Special Values}: Exponent bit field $e$ is all $1$s.\\
$m = 0$ represents infinitiy, which is signed using $s$.\\
$m \neq 0$ is \verb|NaN|, regardless of what is in $m$ or $s$.
\end{enumerate}
\content{Why is the Bias chosen this way?} It allows smooth transitions between normalized and denormalized values.