[SPCA] restructure

This commit is contained in:
RobinB27
2026-01-15 17:04:16 +01:00
parent 26849b84d3
commit a656f3b4b0
2 changed files with 2 additions and 0 deletions

View File

@@ -1,246 +0,0 @@
\newpage
\subsection{Floating Point}
Floating point numbers are a representation of real numbers.
Though there are many ways to accomplish this, \textit{IEEE Standard 754} is used practically everywhere, also in \verb|x86|. This standard is a little more complicated than fractional binary numbers, but has a few numeric advantages, especially for representing very large (very small) numbers.
\hlurl{float.exposed}\ is an excellent website to understand floating point by example.
\subsubsection{Fractional Binary Numbers}
We can represent any real number (with a finite decimal representation) as:
$$
d=\sum_{i=-n}^{m}10^i\cdot d_i \qquad\qquad \underbrace{d_m d_{m-1} \cdots d_1 d_0\ .\ d_{-1} d_{-2} \cdots d_{-(n-1)} d_{-n}}_{d_i \text{ is the } i \text{-th digit of } d \text{ (neg. indices indicate decimals)}}
$$
We can use the same idea for Base $2$ as well:
$$
b=\sum_{i=-n}^{m} 2^i \cdot b_i \qquad\qquad b_m b_{m-1} \cdots b_1 b_0\ .\ b_{-1} b_{-2} \cdots b_{-(n-1)} b_{-n}
$$
To get an intuition for this representation, looking at some examples is helpful:
\begin{multicols}{2}
A few observations:
\begin{enumerate}
\item Shifting the dot right: Division by $2$
\item Shifting the dot left: Multiply by $2$
\item Numbers of the form $0.111\ldots$ are just below $1.0$
\item Some numbers representable in finite Base $10$ are infinite in Base $2$, e.g. $\frac{1}{5} = 0.20_{10}$
\end{enumerate}
\newcolumn
\renewcommand{\arraystretch}{1.2}
\begin{center}
\begin{tabular}{lcl}
\textbf{Binary} & \textbf{Fraction} & \textbf{Decimal} \\
\hline
$0.0$ & $\frac{0}{2}$ & $0.0$ \\
$0.01$ & $\frac{1}{4}$ & $0.25$ \\
$0.010$ & $\frac{2}{8}$ & $0.25$ \\
$0.0011$ & $\frac{3}{16}$ & $0.1875$ \\
$0.00110$ & $\frac{6}{32}$ & $0.1875$ \\
$0.001101$ & $\frac{13}{64}$ & $0.203125$ \\
$0.0011010$ & $\frac{26}{128}$ & $0.203125$ \\
$0.00110101$ & $\frac{51}{256}$ & $0.19921875$ \\
\end{tabular}
\end{center}
\renewcommand{\arraystretch}{1.0}
\end{multicols}
A major issue with this representation is that very large (respectively very small) numbers require a large representation.\\
E.g $a_{10} = 5 \cdot 2^{100}$ has the representation $a_2 = 101\underbrace{000000000000000\ldots}_{100 \text{ Zeros}}\ $. Floating Point is designed to address this.
\subsubsection{Floating Point Representation}
Floating point numbers instead use the representation:
$$
a = \underbrace{(-1)^s}_\text{Sign} \cdot \underbrace{M}_\text{Mantissa} \cdot \underbrace{2^E}_\text{Exponent}
$$
Single precision and Double precision floating point numbers store the $3$ parameters in separate bit fields $s, e, m$:
\begin{center}
Single Precision:
\begin{tabular}{|c|c|c|}
\hline
$31$: Sign & $30-23$: Exponent & $22-0$: Mantissa \\
\hline
\end{tabular} \\
Bias: $127$, Exponent range: $[-126, 127]$
\end{center}
\begin{center}
Double Precision:
\begin{tabular}{|c|c|c|}
\hline
$63$: Sign & $62-52$: Exponent & $51-0$: Mantissa \\
\hline
\end{tabular}\\
Bias: $1023$, Exponent range: $[-1022, 1023]$
\end{center}
Most of the extra precision in $64$b floating point numbers is associated to the mantissa. Note how double precision is necessary to represent all $32$b signed Integers, and not all $64$b signed Integers can be represented in either format.
\newpage
The way these bitfields are interpretd \textit{differs} based on the exponent field $e$:
\begin{enumerate}
\item \textbf{Normalized Values}: Exponent bit field $e$ is neither all $1$s nor all $0$s.\\
In this case, $E$ is read in \textit{biased} form: $E = e - b$. The bias is $b=2^{k-1}-1$, where $k$ is the amount of bits reserved for $e$. This produces the exponent ranges $E \in [-(b-1), b]$.\\
The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0 + 1$, where $n$ is the amount of bits reserved for $m$
\item \textbf{Denormalized Values}: Exponent bit field $e$ is all $0$s.\\
In this case, $E$ is read in \textit{biased} form $E = 1 - b$. (Instead of $E = e - b$)\\
The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0$ (without adding $1$)
\item \textbf{Special Values}: Exponent bit field $e$ is all $1$s.\\
$m = 0$ represents infinitiy, which is signed using $s$.\\
$m \neq 0$ is \verb|NaN|, regardless of what is in $m$ or $s$.
\end{enumerate}
\content{Why is the Bias chosen this way?} It allows smooth transitions between normalized and denormalized values.
\subsubsection{Properties}
The advantage of having denormalized values is that 0 can be represented as the bit-field with all $0$s. Further, this enforces equidistant points for values close to $0$, whereas normalized values increase in distance as they move further from $0$.
\content{Example} $8$b Floating Point table to visualize the different cases.
$$
8\text{b precision Floating Point:}\quad \underbrace{0}_s \underbrace{0000}_c \underbrace{000}_m
$$
\renewcommand{\arraystretch}{1.2}
\begin{center}
\begin{tabular}{llllll}
\hline
Case & $s$ & $e$ & $m$ & $E$ & Value \\
\hline
\multirow{6}{*}{Denormalized}
& 0 & 0000 & 000 & $-6$ & $0$ \\
& 0 & 0000 & 001 & $-6$ & $\frac{1}{8}\cdot\frac{1}{64}=\frac{1}{512}$ \\
& 0 & 0000 & 010 & $-6$ & $\frac{2}{8}\cdot\frac{1}{64}=\frac{2}{512}$ \\
& & & $\vdots$ & & $\vdots$ \\
& 0 & 0000 & 110 & $-6$ & $\frac{6}{8}\cdot\frac{1}{64}=\frac{6}{512}$ \\
& 0 & 0000 & 111 & $-6$ & $\frac{7}{8}\cdot\frac{1}{64}=\frac{7}{512}$ \\
\hline
\multirow{9}{*}{Normalized}
& 0 & 0001 & 000 & $-6$ & $\frac{8}{8}\cdot\frac{1}{64}=\frac{8}{512}$ \\
& 0 & 0001 & 001 & $-6$ & $\frac{9}{8}\cdot\frac{1}{64}=\frac{9}{512}$ \\
& & & $\vdots$ & & $\vdots$ \\
& 0 & 0110 & 110 & $-1$ & $\frac{14}{8}\cdot\frac{1}{2}=\frac{14}{16}$ \\
& 0 & 0110 & 111 & $-1$ & $\frac{15}{8}\cdot\frac{1}{2}=\frac{15}{16}$ \\
& 0 & 0111 & 000 & $0$ & $\frac{8}{8}\cdot 1 = 1$ \\
& 0 & 0111 & 001 & $0$ & $\frac{9}{8}\cdot 1 = \frac{9}{8}$ \\
& 0 & 0111 & 010 & $0$ & $\frac{10}{8}\cdot 1 = \frac{10}{8}$ \\
& & & $\vdots$ & & $\vdots$ \\
& 0 & 1110 & 110 & $7$ & $\frac{14}{8}\cdot 128 = 224$ \\
& 0 & 1110 & 111 & $7$ & $\frac{15}{8}\cdot 128 = 240$ \\
\hline
Special
& 0 & 1111 & 000 & n/a & $\infty$ \\
\hline
\end{tabular}
\end{center}
\renewcommand{\arraystretch}{1.0}
\newpage
\subsubsection{Rounding}
The basic idea of Floating Point operations is:
\begin{enumerate}
\item Compute exact result
\item Round, so it fits the desired precision
\end{enumerate}
\textit{IEEE Standard 754} specifies $4$ rounding modes: \textit{Towards Zero, Round Down, Round Up, Nearest Even}.
The default used is \textit{Nearest Even}\footnote{Changing the rounding mode is usually hard to do without using Assembly.}, which rounds up/down depending on which number is closer, like regular rounding, but picks the nearest even number if it's exactly in the middle.
Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$
$$
a = 1.BB\ldots BB\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}\underbrace{XX\ldots XX}_\text{Sticky}
$$
\begin{enumerate}
\item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result
\item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding
\item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits.
\end{enumerate}
Based on these bits the rounding can be decided:
$$
R \land S \implies \text{ Round up} \qquad\qquad
G \land R \land \lnot S \implies \text{ Round to even}
$$
\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa):
\renewcommand{\arraystretch}{1.2}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\textbf{Value} & \textbf{Fraction} & \textbf{GRS} & \textbf{Incr?} & \textbf{Rounded} \\
\hline
$128$ & $1.000|0000$ & $000$ & N & $1.000$ \\
$13$ & $1.101|0000$ & $100$ & N & $1.101$ \\
$17$ & $1.000|1000$ & $010$ & N & $1.000$ \\
$19$ & $1.001|1000$ & $110$ & Y & $1.010$ \\
$138$ & $1.000|1010$ & $011$ & Y & $1.001$ \\
$63$ & $1.111|1100$ & $111$ & Y & $10.000$ \\
\hline
\end{tabular}
\end{center}
\renewcommand{\arraystretch}{1.0}
\textbf{Post-Normalization}: Rounding may cause overflow. In this case: Shift right once and increment exponent.
\newpage
\subsubsection{Operations}
\content{Multiplication} is straightforward, all $3$ parameters can be operated on separately:
$$
(-1)^{s_1}M_1 \cdot 2^{E_1} \cdot (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s_1 \oplus s_2} (M_1 \cdot M_2) 2^{E_1 + E_2}
$$
\textbf{Post-Normalization}:
\begin{enumerate}
\item If $M \geq 2$, shift $M$ right and increment $E$
\item If $E$ out of range, overflow (set to $\infty$)
\item Round $M$ to fit desired precision.
\end{enumerate}
\content{Addition} is more complicated: (Assumption: $E_1 \geq E_2$)
$$
(-1)^{s_1}M_1 \cdot 2^{E_1} + (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s'} M' \cdot 2^{E_1}
$$
$s', M'$ are the result of a signed align \& add.\\
This means $(-1)^{s_1}M_1$ is shifted left by $E_1-E_2$, and then $(-1)^{s_2}M_2$ is added.
\textbf{Post-Normalization}:
\begin{enumerate}
\item if $M \geq 2$, shift $M$ right, increment $E$
\item if $M \leq 1$, shift $M$ left $k$, decrement $E$ by $k$
\item Overflow $E$ if out of range (set to $\infty$)
\item Round $M$ to desired precision
\end{enumerate}
\subsubsection{Mathematical Properties}
Floating point is \textit{almost} an Abelian Group.
\begin{itemize}
\item \textbf{Closed} under Addition, Multiplication (But may generate \verb|NaN|, $\pm \infty$)
\item \textbf{Commutative}
\item \textbf{Not Associative} (Overflow \& Rounding)
\item $0, 1$ \textbf{are Identity}
\item \textbf{Additive Inverse} (Except $\pm \infty$ and \verb|NaN|)
\item \textbf{Monotonicity} (Except for $\pm \infty$ and \verb|NaN|)
\item \textbf{Not Distributive} (Overflow \& Rounding)
\end{itemize}
\subsubsection{Floating Point in C}
C99 guarantees \verb|float| and \verb|double|, and \verb|long double| is usually interpreted as quadruple precision.
Casting/Conversion between Integer types and \verb|float|, \verb|double| \textit{changes} the bit representation in most cases (e.g. $0$ stays the same)