eth-summaries/semester3/spca/parts/01_c/06_floating-point/02_representation.tex

\subsubsection{Floating Point Representation}
Floating point numbers instead use the representation:
$$
    a = \underbrace{(-1)^s}_\text{Sign} \cdot \underbrace{M}_\text{Mantissa} \cdot \underbrace{2^E}_\text{Exponent}
$$

Single precision and Double precision floating point numbers store the $3$ parameters in separate bit fields $s, e, m$:

\begin{center}
    Single Precision:
    \begin{tabular}{|c|c|c|}
        \hline
        $31$: Sign & $30-23$: Exponent & $22-0$: Mantissa \\
        \hline
    \end{tabular} \\
    Bias: $127$, Exponent range: $[-126, 127]$
\end{center}
\begin{center}
    Double Precision:
    \begin{tabular}{|c|c|c|}
        \hline
        $63$: Sign & $62-52$: Exponent & $51-0$: Mantissa \\
        \hline
    \end{tabular}\\
    Bias: $1023$, Exponent range: $[-1022, 1023]$
\end{center}

Most of the extra precision in $64$b floating point numbers is associated to the mantissa. Note how double precision is necessary to represent all $32$b signed Integers, and not all $64$b signed Integers can be represented in either format.

\newpage

The way these bitfields are interpretd \textit{differs} based on the exponent field $e$:

\begin{enumerate}
    \item \textbf{Normalized Values}: Exponent bit field $e$ is neither all $1$s nor all $0$s.\\
          In this case, $E$ is read in \textit{biased} form: $E = e - b$. The bias is $b=2^{k-1}-1$, where $k$ is the amount of bits reserved for $e$. This produces the exponent ranges $E \in [-(b-1), b]$.\\
          The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0 + 1$, where $n$ is the amount of bits reserved for $m$
    \item \textbf{Denormalized Values}: Exponent bit field $e$ is all $0$s.\\
          In this case, $E$ is read in \textit{biased} form $E = 1 - b$. (Instead of $E = e - b$)\\
          The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0$ (without adding $1$)
    \item \textbf{Special Values}: Exponent bit field $e$ is all $1$s.\\
          $m = 0$ represents infinitiy, which is signed using $s$.\\
          $m \neq 0$ is \verb|NaN|, regardless of what is in $m$ or $s$.
\end{enumerate}

\content{Why is the Bias chosen this way?} It allows smooth transitions between normalized and denormalized values.