[SPCA] restructure

2026-06-14 00:31:18 +02:00 · 2026-01-15 17:04:16 +01:00
parent 26849b84d3
commit a656f3b4b0
2 changed files with 2 additions and 0 deletions
@@ -1,246 +0,0 @@
-\newpage
-\subsection{Floating Point}
-
-Floating point numbers are a representation of real numbers. 
-
-Though there are many ways to accomplish this, \textit{IEEE Standard 754} is used practically everywhere, also in \verb|x86|. This standard is a little more complicated than fractional binary numbers, but has a few numeric advantages, especially for representing very large (very small) numbers.
-
-\hlurl{float.exposed}\ is an excellent website to understand floating point by example.
-
-\subsubsection{Fractional Binary Numbers}
-
-We can represent any real number (with a finite decimal representation) as:
-$$
-    d=\sum_{i=-n}^{m}10^i\cdot d_i \qquad\qquad \underbrace{d_m d_{m-1} \cdots d_1 d_0\ .\ d_{-1} d_{-2} \cdots d_{-(n-1)} d_{-n}}_{d_i \text{ is the } i \text{-th digit of } d \text{ (neg. indices indicate decimals)}}
-$$
-We can use the same idea for Base $2$ as well:
-$$
-    b=\sum_{i=-n}^{m} 2^i \cdot b_i \qquad\qquad b_m b_{m-1} \cdots b_1 b_0\ .\ b_{-1} b_{-2} \cdots b_{-(n-1)} b_{-n}
-$$
-To get an intuition for this representation, looking at some examples is helpful:
-\begin{multicols}{2}
-
-A few observations:
-\begin{enumerate}
-    \item Shifting the dot right: Division by $2$
-    \item Shifting the dot left: Multiply by $2$
-    \item Numbers of the form $0.111\ldots$ are just below $1.0$
-    \item Some numbers representable in finite Base $10$ are infinite in Base $2$, e.g. $\frac{1}{5} = 0.20_{10}$
-\end{enumerate}
-
-\newcolumn
-
-\renewcommand{\arraystretch}{1.2}
-\begin{center}
-    \begin{tabular}{lcl}
-        \textbf{Binary} & \textbf{Fraction} & \textbf{Decimal} \\
-        \hline
-        $0.0$           & $\frac{0}{2}$     & $0.0$         \\
-        $0.01$          & $\frac{1}{4}$     & $0.25$        \\
-        $0.010$         & $\frac{2}{8}$     & $0.25$        \\
-        $0.0011$        & $\frac{3}{16}$    & $0.1875$      \\
-        $0.00110$       & $\frac{6}{32}$    & $0.1875$      \\
-        $0.001101$      & $\frac{13}{64}$   & $0.203125$    \\
-        $0.0011010$     & $\frac{26}{128}$  & $0.203125$    \\
-        $0.00110101$    & $\frac{51}{256}$  & $0.19921875$  \\
-    \end{tabular}
-\end{center}
-\renewcommand{\arraystretch}{1.0}
-
-\end{multicols}
-
-A major issue with this representation is that very large (respectively very small) numbers require a large representation.\\
-E.g $a_{10} = 5 \cdot 2^{100}$ has the representation $a_2 = 101\underbrace{000000000000000\ldots}_{100 \text{ Zeros}}\ $. Floating Point is designed to address this.
-
-\subsubsection{Floating Point Representation}
-Floating point numbers instead use the representation:
-$$
-    a = \underbrace{(-1)^s}_\text{Sign} \cdot \underbrace{M}_\text{Mantissa} \cdot \underbrace{2^E}_\text{Exponent}
-$$
-
-Single precision and Double precision floating point numbers store the $3$ parameters in separate bit fields $s, e, m$:
-
-\begin{center}
-    Single Precision: 
-    \begin{tabular}{|c|c|c|}
-        \hline
-        $31$: Sign & $30-23$: Exponent & $22-0$: Mantissa \\
-        \hline
-    \end{tabular} \\
-    Bias: $127$, Exponent range: $[-126, 127]$
-\end{center}
-\begin{center}
-    Double Precision: 
-    \begin{tabular}{|c|c|c|}
-        \hline
-        $63$: Sign & $62-52$: Exponent & $51-0$: Mantissa \\
-        \hline
-    \end{tabular}\\
-    Bias: $1023$, Exponent range: $[-1022, 1023]$
-\end{center}
-
-Most of the extra precision in $64$b floating point numbers is associated to the mantissa. Note how double precision is necessary to represent all $32$b signed Integers, and not all $64$b signed Integers can be represented in either format.
-
-\newpage
-
-The way these bitfields are interpretd \textit{differs} based on the exponent field $e$:
-
-\begin{enumerate}
-    \item \textbf{Normalized Values}: Exponent bit field $e$ is neither all $1$s nor all $0$s.\\
-    In this case, $E$ is read in \textit{biased} form: $E = e - b$. The bias is $b=2^{k-1}-1$, where $k$ is the amount of bits reserved for $e$. This produces the exponent ranges $E \in [-(b-1), b]$.\\ 
-    The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0 + 1$, where $n$ is the amount of bits reserved for $m$
-    \item \textbf{Denormalized Values}: Exponent bit field $e$ is all $0$s.\\
-    In this case, $E$ is read in \textit{biased} form $E = 1 - b$. (Instead of $E = e - b$)\\
-    The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0$ (without adding $1$)
-    \item \textbf{Special Values}: Exponent bit field $e$ is all $1$s.\\
-    $m = 0$ represents infinitiy, which is signed using $s$.\\
-    $m \neq 0$ is \verb|NaN|, regardless of what is in $m$ or $s$.
-\end{enumerate}
-
-\content{Why is the Bias chosen this way?} It allows smooth transitions between normalized and denormalized values.
-
-\subsubsection{Properties}
-
-The advantage of having denormalized values is that 0 can be represented as the bit-field with all $0$s. Further, this enforces equidistant points for values close to $0$, whereas normalized values increase in distance as they move further from $0$.
-
-\content{Example} $8$b Floating Point table to visualize the different cases.
-$$
-    8\text{b precision Floating Point:}\quad \underbrace{0}_s \underbrace{0000}_c \underbrace{000}_m
-$$
-
-\renewcommand{\arraystretch}{1.2}
-\begin{center}
-    \begin{tabular}{llllll}
-        \hline
-        Case & $s$ & $e$ & $m$ & $E$ & Value \\ 
-        \hline
-        \multirow{6}{*}{Denormalized} 
-        & 0 & 0000 & 000 & $-6$ & $0$ \\
-        & 0 & 0000 & 001 & $-6$ & $\frac{1}{8}\cdot\frac{1}{64}=\frac{1}{512}$ \\
-        & 0 & 0000 & 010 & $-6$ & $\frac{2}{8}\cdot\frac{1}{64}=\frac{2}{512}$ \\
-        &   &      & $\vdots$ &      & $\vdots$ \\
-        & 0 & 0000 & 110 & $-6$ & $\frac{6}{8}\cdot\frac{1}{64}=\frac{6}{512}$ \\
-        & 0 & 0000 & 111 & $-6$ & $\frac{7}{8}\cdot\frac{1}{64}=\frac{7}{512}$ \\
-        \hline
-        \multirow{9}{*}{Normalized}
-        & 0 & 0001 & 000 & $-6$ & $\frac{8}{8}\cdot\frac{1}{64}=\frac{8}{512}$ \\
-        & 0 & 0001 & 001 & $-6$ & $\frac{9}{8}\cdot\frac{1}{64}=\frac{9}{512}$ \\
-        &   &      & $\vdots$ &      & $\vdots$ \\
-        & 0 & 0110 & 110 & $-1$ & $\frac{14}{8}\cdot\frac{1}{2}=\frac{14}{16}$ \\
-        & 0 & 0110 & 111 & $-1$ & $\frac{15}{8}\cdot\frac{1}{2}=\frac{15}{16}$ \\
-        & 0 & 0111 & 000 & $0$ & $\frac{8}{8}\cdot 1 = 1$ \\
-        & 0 & 0111 & 001 & $0$ & $\frac{9}{8}\cdot 1 = \frac{9}{8}$   \\
-        & 0 & 0111 & 010 & $0$ & $\frac{10}{8}\cdot 1 = \frac{10}{8}$ \\
-        &   &      & $\vdots$ &      & $\vdots$ \\
-        & 0 & 1110 & 110 & $7$ & $\frac{14}{8}\cdot 128 = 224$ \\
-        & 0 & 1110 & 111 & $7$ & $\frac{15}{8}\cdot 128 = 240$ \\
-        \hline
-        Special
-        & 0 & 1111 & 000 & n/a & $\infty$ \\
-        \hline
-    \end{tabular}
-\end{center}
-\renewcommand{\arraystretch}{1.0}
-
-\newpage
-
-\subsubsection{Rounding}
-
-The basic idea of Floating Point operations is:
-\begin{enumerate}
-    \item Compute exact result
-    \item Round, so it fits the desired precision
-\end{enumerate}
-
-\textit{IEEE Standard 754} specifies $4$ rounding modes: \textit{Towards Zero, Round Down, Round Up, Nearest Even}.
-
-The default used is \textit{Nearest Even}\footnote{Changing the rounding mode is usually hard to do without using Assembly.}, which rounds up/down depending on which number is closer, like regular rounding, but picks the nearest even number if it's exactly in the middle.
-
-Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$
-$$
-    a = 1.BB\ldots BB\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}\underbrace{XX\ldots XX}_\text{Sticky}
-$$
-
-\begin{enumerate}
-    \item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result
-    \item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding
-    \item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits.
-\end{enumerate}
-
-Based on these bits the rounding can be decided:
-
-$$
-    R \land S \implies \text{ Round up} \qquad\qquad
-    G \land R \land \lnot S \implies \text{ Round to even}
-$$
-
-\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa):
-
-\renewcommand{\arraystretch}{1.2}
-\begin{center}
-    \begin{tabular}{|c|c|c|c|c|}
-        \hline
-        \textbf{Value} & \textbf{Fraction} & \textbf{GRS} & \textbf{Incr?} & \textbf{Rounded} \\
-        \hline
-        $128$ & $1.000|0000$ & $000$ & N & $1.000$  \\
-        $13$  & $1.101|0000$ & $100$ & N & $1.101$  \\
-        $17$  & $1.000|1000$ & $010$ & N & $1.000$  \\
-        $19$  & $1.001|1000$ & $110$ & Y & $1.010$  \\
-        $138$ & $1.000|1010$ & $011$ & Y & $1.001$  \\
-        $63$  & $1.111|1100$ & $111$ & Y & $10.000$ \\
-        \hline
-    \end{tabular} 
-\end{center}
-\renewcommand{\arraystretch}{1.0}
-
-
-\textbf{Post-Normalization}: Rounding may cause overflow. In this case: Shift right once and increment exponent.
-
-\newpage 
-
-\subsubsection{Operations}
-
-\content{Multiplication} is straightforward, all $3$ parameters can be operated on separately:
-$$
-    (-1)^{s_1}M_1 \cdot 2^{E_1} \cdot (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s_1 \oplus s_2} (M_1 \cdot M_2) 2^{E_1 + E_2} 
-$$
-\textbf{Post-Normalization}: 
-\begin{enumerate}
-    \item If $M \geq 2$, shift $M$ right and increment $E$
-    \item If $E$ out of range, overflow (set to $\infty$)
-    \item Round $M$ to fit desired precision.
-\end{enumerate}
-
-\content{Addition} is more complicated: (Assumption: $E_1 \geq E_2$)
-$$
-    (-1)^{s_1}M_1 \cdot 2^{E_1} + (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s'} M' \cdot 2^{E_1}
-$$
-$s', M'$ are the result of a signed align \& add.\\
-This means $(-1)^{s_1}M_1$ is shifted left by $E_1-E_2$, and then $(-1)^{s_2}M_2$ is added.
-
-\textbf{Post-Normalization}:
-\begin{enumerate}
-    \item if $M \geq 2$, shift $M$ right, increment $E$
-    \item if $M \leq 1$, shift $M$ left $k$, decrement $E$ by $k$
-    \item Overflow $E$ if out of range (set to $\infty$)
-    \item Round $M$ to desired precision
-\end{enumerate}
-
-\subsubsection{Mathematical Properties}
-
-Floating point is \textit{almost} an Abelian Group.
-\begin{itemize}
-    \item \textbf{Closed} under Addition, Multiplication (But may generate \verb|NaN|, $\pm \infty$)
-    \item \textbf{Commutative}
-    \item \textbf{Not Associative} (Overflow \& Rounding)
-    \item $0, 1$ \textbf{are Identity}
-    \item \textbf{Additive Inverse} (Except $\pm \infty$ and \verb|NaN|)
-    \item \textbf{Monotonicity} (Except for $\pm \infty$ and \verb|NaN|)
-    \item \textbf{Not Distributive} (Overflow \& Rounding)
-\end{itemize}
-
-\subsubsection{Floating Point in C}
-
-C99 guarantees \verb|float| and \verb|double|, and \verb|long double| is usually interpreted as quadruple precision.
-
-Casting/Conversion between Integer types and \verb|float|, \verb|double| \textit{changes} the bit representation in most cases (e.g. $0$ stays the same)