diff --git a/semester3/spca/parts/01_c/05_floating-point/00_intro.tex b/semester3/spca/parts/01_c/05_floating-point/00_intro.tex deleted file mode 100644 index e69de29..0000000 diff --git a/semester3/spca/parts/01_c/06_floating-point/00_intro.tex b/semester3/spca/parts/01_c/06_floating-point/00_intro.tex new file mode 100644 index 0000000..674fa60 --- /dev/null +++ b/semester3/spca/parts/01_c/06_floating-point/00_intro.tex @@ -0,0 +1,246 @@ +\newpage +\subsection{Floating Point} + +Floating point numbers are a representation of real numbers. + +Though there are many ways to accomplish this, \textit{IEEE Standard 754} is used practically everywhere, also in \verb|x86|. This standard is a little more complicated than fractional binary numbers, but has a few numeric advantages, especially for representing very large (very small) numbers. + +\hlurl{float.exposed}\ is an excellent website to understand floating point by example. + +\subsubsection{Fractional Binary Numbers} + +We can represent any real number (with a finite decimal representation) as: +$$ + d=\sum_{i=-n}^{m}10^i\cdot d_i \qquad\qquad \underbrace{d_m d_{m-1} \cdots d_1 d_0\ .\ d_{-1} d_{-2} \cdots d_{-(n-1)} d_{-n}}_{d_i \text{ is the } i \text{-th digit of } d \text{ (neg. indices indicate decimals)}} +$$ +We can use the same idea for Base $2$ as well: +$$ + b=\sum_{i=-n}^{m} 2^i \cdot b_i \qquad\qquad b_m b_{m-1} \cdots b_1 b_0\ .\ b_{-1} b_{-2} \cdots b_{-(n-1)} b_{-n} +$$ +To get an intuition for this representation, looking at some examples is helpful: +\begin{multicols}{2} + +A few observations: +\begin{enumerate} + \item Shifting the dot right: Division by $2$ + \item Shifting the dot left: Multiply by $2$ + \item Numbers of the form $0.111\ldots$ are just below $1.0$ + \item Some numbers representable in finite Base $10$ are infinite in Base $2$, e.g. $\frac{1}{5} = 0.20_{10}$ +\end{enumerate} + +\newcolumn + +\renewcommand{\arraystretch}{1.2} +\begin{center} + \begin{tabular}{lcl} + \textbf{Binary} & \textbf{Fraction} & \textbf{Decimal} \\ + \hline + $0.0$ & $\frac{0}{2}$ & $0.0$ \\ + $0.01$ & $\frac{1}{4}$ & $0.25$ \\ + $0.010$ & $\frac{2}{8}$ & $0.25$ \\ + $0.0011$ & $\frac{3}{16}$ & $0.1875$ \\ + $0.00110$ & $\frac{6}{32}$ & $0.1875$ \\ + $0.001101$ & $\frac{13}{64}$ & $0.203125$ \\ + $0.0011010$ & $\frac{26}{128}$ & $0.203125$ \\ + $0.00110101$ & $\frac{51}{256}$ & $0.19921875$ \\ + \end{tabular} +\end{center} +\renewcommand{\arraystretch}{1.0} + +\end{multicols} + +A major issue with this representation is that very large (respectively very small) numbers require a large representation.\\ +E.g $a_{10} = 5 \cdot 2^{100}$ has the representation $a_2 = 101\underbrace{000000000000000\ldots}_{100 \text{ Zeros}}\ $. Floating Point is designed to address this. + +\subsubsection{Floating Point Representation} +Floating point numbers instead use the representation: +$$ + a = \underbrace{(-1)^s}_\text{Sign} \cdot \underbrace{M}_\text{Mantissa} \cdot \underbrace{2^E}_\text{Exponent} +$$ + +Single precision and Double precision floating point numbers store the $3$ parameters in separate bit fields $s, e, m$: + +\begin{center} + Single Precision: + \begin{tabular}{|c|c|c|} + \hline + $31$: Sign & $30-23$: Exponent & $22-0$: Mantissa \\ + \hline + \end{tabular} \\ + Bias: $127$, Exponent range: $[-126, 127]$ +\end{center} +\begin{center} + Double Precision: + \begin{tabular}{|c|c|c|} + \hline + $63$: Sign & $62-52$: Exponent & $51-0$: Mantissa \\ + \hline + \end{tabular}\\ + Bias: $1023$, Exponent range: $[-1022, 1023]$ +\end{center} + +Most of the extra precision in $64$b floating point numbers is associated to the mantissa. Note how double precision is necessary to represent all $32$b signed Integers, and not all $64$b signed Integers can be represented in either format. + +\newpage + +The way these bitfields are interpretd \textit{differs} based on the exponent field $e$: + +\begin{enumerate} + \item \textbf{Normalized Values}: Exponent bit field $e$ is neither all $1$s nor all $0$s.\\ + In this case, $E$ is read in \textit{biased} form: $E = e - b$. The bias is $b=2^{k-1}-1$, where $k$ is the amount of bits reserved for $e$. This produces the exponent ranges $E \in [-(b-1), b]$.\\ + The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0 + 1$, where $n$ is the amount of bits reserved for $m$ + \item \textbf{Denormalized Values}: Exponent bit field $e$ is all $0$s.\\ + In this case, $E$ is read in \textit{biased} form $E = 1 - b$. (Instead of $E = e - b$)\\ + The mantissa field $m$ is interpreted as $M = 0.m_{n-1}\ldots m_1 m_0$ (without adding $1$) + \item \textbf{Special Values}: Exponent bit field $e$ is all $1$s.\\ + $m = 0$ represents infinitiy, which is signed using $s$.\\ + $m \neq 0$ is \verb|NaN|, regardless of what is in $m$ or $s$. +\end{enumerate} + +\content{Why is the Bias chosen this way?} It allows smooth transitions between normalized and denormalized values. + +\subsubsection{Properties} + +The advantage of having denormalized values is that 0 can be represented as the bit-field with all $0$s. Further, this enforces equidistant points for values close to $0$, whereas normalized values increase in distance as they move further from $0$. + +\content{Example} $8$b Floating Point table to visualize the different cases. +$$ + 8\text{b precision Floating Point:}\quad \underbrace{0}_s \underbrace{0000}_c \underbrace{000}_m +$$ + +\renewcommand{\arraystretch}{1.2} +\begin{center} + \begin{tabular}{llllll} + \hline + Case & $s$ & $e$ & $m$ & $E$ & Value \\ + \hline + \multirow{6}{*}{Denormalized} + & 0 & 0000 & 000 & $-6$ & $0$ \\ + & 0 & 0000 & 001 & $-6$ & $\frac{1}{8}\cdot\frac{1}{64}=\frac{1}{512}$ \\ + & 0 & 0000 & 010 & $-6$ & $\frac{2}{8}\cdot\frac{1}{64}=\frac{2}{512}$ \\ + & & & $\vdots$ & & $\vdots$ \\ + & 0 & 0000 & 110 & $-6$ & $\frac{6}{8}\cdot\frac{1}{64}=\frac{6}{512}$ \\ + & 0 & 0000 & 111 & $-6$ & $\frac{7}{8}\cdot\frac{1}{64}=\frac{7}{512}$ \\ + \hline + \multirow{9}{*}{Normalized} + & 0 & 0001 & 000 & $-6$ & $\frac{8}{8}\cdot\frac{1}{64}=\frac{8}{512}$ \\ + & 0 & 0001 & 001 & $-6$ & $\frac{9}{8}\cdot\frac{1}{64}=\frac{9}{512}$ \\ + & & & $\vdots$ & & $\vdots$ \\ + & 0 & 0110 & 110 & $-1$ & $\frac{14}{8}\cdot\frac{1}{2}=\frac{14}{16}$ \\ + & 0 & 0110 & 111 & $-1$ & $\frac{15}{8}\cdot\frac{1}{2}=\frac{15}{16}$ \\ + & 0 & 0111 & 000 & $0$ & $\frac{8}{8}\cdot 1 = 1$ \\ + & 0 & 0111 & 001 & $0$ & $\frac{9}{8}\cdot 1 = \frac{9}{8}$ \\ + & 0 & 0111 & 010 & $0$ & $\frac{10}{8}\cdot 1 = \frac{10}{8}$ \\ + & & & $\vdots$ & & $\vdots$ \\ + & 0 & 1110 & 110 & $7$ & $\frac{14}{8}\cdot 128 = 224$ \\ + & 0 & 1110 & 111 & $7$ & $\frac{15}{8}\cdot 128 = 240$ \\ + \hline + Special + & 0 & 1111 & 000 & n/a & $\infty$ \\ + \hline + \end{tabular} +\end{center} +\renewcommand{\arraystretch}{1.0} + +\newpage + +\subsubsection{Rounding} + +The basic idea of Floating Point operations is: +\begin{enumerate} + \item Compute exact result + \item Round, so it fits the desired precision +\end{enumerate} + +\textit{IEEE Standard 754} specifies $4$ rounding modes: \textit{Towards Zero, Round Down, Round Up, Nearest Even}. + +The default used is \textit{Nearest Even}\footnote{Changing the rounding mode is usually hard to do without using Assembly.}, which rounds up/down depending on which number is closer, like regular rounding, but picks the nearest even number if it's exactly in the middle. + +Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$ +$$ + a = 1.BB\ldots BB\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}\underbrace{XX\ldots XX}_\text{Sticky} +$$ + +\begin{enumerate} + \item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result + \item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding + \item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits. +\end{enumerate} + +Based on these bits the rounding can be decided: + +$$ + R \land S \implies \text{ Round up} \qquad\qquad + G \land R \land \lnot S \implies \text{ Round to even} +$$ + +\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa): + +\renewcommand{\arraystretch}{1.2} +\begin{center} + \begin{tabular}{|c|c|c|c|c|} + \hline + \textbf{Value} & \textbf{Fraction} & \textbf{GRS} & \textbf{Incr?} & \textbf{Rounded} \\ + \hline + $128$ & $1.000|0000$ & $000$ & N & $1.000$ \\ + $13$ & $1.101|0000$ & $100$ & N & $1.101$ \\ + $17$ & $1.000|1000$ & $010$ & N & $1.000$ \\ + $19$ & $1.001|1000$ & $110$ & Y & $1.010$ \\ + $138$ & $1.000|1010$ & $011$ & Y & $1.001$ \\ + $63$ & $1.111|1100$ & $111$ & Y & $10.000$ \\ + \hline + \end{tabular} +\end{center} +\renewcommand{\arraystretch}{1.0} + + +\textbf{Post-Normalization}: Rounding may cause overflow. In this case: Shift right once and increment exponent. + +\newpage + +\subsubsection{Operations} + +\content{Multiplication} is straightforward, all $3$ parameters can be operated on separately: +$$ + (-1)^{s_1}M_1 \cdot 2^{E_1} \cdot (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s_1 \oplus s_2} (M_1 \cdot M_2) 2^{E_1 + E_2} +$$ +\textbf{Post-Normalization}: +\begin{enumerate} + \item If $M \geq 2$, shift $M$ right and increment $E$ + \item If $E$ out of range, overflow (set to $\infty$) + \item Round $M$ to fit desired precision. +\end{enumerate} + +\content{Addition} is more complicated: (Assumption: $E_1 \geq E_2$) +$$ + (-1)^{s_1}M_1 \cdot 2^{E_1} + (-1)^{s_2} M_2 \cdot 2^{E_2} \quad = \quad (-1)^{s'} M' \cdot 2^{E_1} +$$ +$s', M'$ are the result of a signed align \& add.\\ +This means $(-1)^{s_1}M_1$ is shifted left by $E_1-E_2$, and then $(-1)^{s_2}M_2$ is added. + +\textbf{Post-Normalization}: +\begin{enumerate} + \item if $M \geq 2$, shift $M$ right, increment $E$ + \item if $M \leq 1$, shift $M$ left $k$, decrement $E$ by $k$ + \item Overflow $E$ if out of range (set to $\infty$) + \item Round $M$ to desired precision +\end{enumerate} + +\subsubsection{Mathematical Properties} + +Floating point is \textit{almost} an Abelian Group. +\begin{itemize} + \item \textbf{Closed} under Addition, Multiplication (But may generate \verb|NaN|, $\pm \infty$) + \item \textbf{Commutative} + \item \textbf{Not Associative} (Overflow \& Rounding) + \item $0, 1$ \textbf{are Identity} + \item \textbf{Additive Inverse} (Except $\pm \infty$ and \verb|NaN|) + \item \textbf{Monotonicity} (Except for $\pm \infty$ and \verb|NaN|) + \item \textbf{Not Distributive} (Overflow \& Rounding) +\end{itemize} + +\subsubsection{Floating Point in C} + +C99 guarantees \verb|float| and \verb|double|, and \verb|long double| is usually interpreted as quadruple precision. + +Casting/Conversion between Integer types and \verb|float|, \verb|double| \textit{changes} the bit representation in most cases (e.g. $0$ stays the same) \ No newline at end of file diff --git a/semester3/spca/spca-summary.pdf b/semester3/spca/spca-summary.pdf deleted file mode 100644 index 15470fc..0000000 Binary files a/semester3/spca/spca-summary.pdf and /dev/null differ diff --git a/semester3/spca/spca-summary.tex b/semester3/spca/spca-summary.tex index 21e9c1b..274ddb2 100644 --- a/semester3/spca/spca-summary.tex +++ b/semester3/spca/spca-summary.tex @@ -9,6 +9,7 @@ \usepackage{lmodern} \setFontType{sans} + \newcommand{\lC}{\texttt{C}} \newcommand{\content}[1]{\shade{blue}{#1}} \newcommand{\warn}[1]{\bg{orange}{#1}}