eth-summaries/semester3/spca/parts/01_c/06_floating-point/04_rounding.tex

\newpage
\subsubsection{Rounding}

The basic idea of Floating Point operations is:
\begin{enumerate}
    \item Compute exact result
    \item Round, so it fits the desired precision
\end{enumerate}

\textit{IEEE Standard 754} specifies $4$ rounding modes: \textit{Towards Zero, Round Down, Round Up, Nearest Even}.

The default used is \textit{Nearest Even}\footnote{Changing the rounding mode is usually hard to do without using Assembly.}, which rounds up/down depending on which number is closer, like regular rounding, but picks the nearest even number if it's exactly in the middle.

Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$
$$
    a = 1.B_1B_2\ldots B_{n - 2}B_{n - 1}\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}
    \underbrace{X_1X_2\ldots X_{k - 1}X_k}_\text{Sticky}
$$
where $n$ is the number of bits in the mantissa of the format (e.g. $3$ as in the above example of an $8$bit floating point number).

\begin{enumerate}
    \item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result (i.e. it is $B_n$)
    \item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding
    \item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits $X_i$.
\end{enumerate}

Based on these bits the rounding can be decided (we increment the rounded part if the expression evaluates to true):
\hrmvspace
\begin{align*}
    \text{Round up: } R \land S
     &  &
    \text{Round to even: } G \land R \land \lnot S
\end{align*}

\drmvspace
It is notable that for round to even, the special condition only applies if the sticky bit is not set. If it is set, the round up condition is to be used.
An easy way to implement the condition is as follows
\mint{c}+(sticky && round) || (!sticky && round && guard)+
This will be ever so slightly more efficient than a different order, as the computation will be stopped shorter if a condition is not fulfilled

\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa):

\renewcommand{\arraystretch}{1.2}
\begin{center}
    \begin{tabular}{|c|c|c|c|c|}
        \hline
        \textbf{Value} & \textbf{Fraction} & \textbf{GRS} & \textbf{Incr?} & \textbf{Rounded} \\
        \hline
        $128$          & $1.000|0000$      & $000$        & N              & $1.000$          \\
        $13$           & $1.101|0000$      & $100$        & N              & $1.101$          \\
        $17$           & $1.000|1000$      & $010$        & N              & $1.000$          \\
        $19$           & $1.001|1000$      & $110$        & Y              & $1.010$          \\
        $138$          & $1.000|1010$      & $011$        & Y              & $1.001$          \\
        $63$           & $1.111|1100$      & $111$        & Y              & $10.000$         \\
        \hline
    \end{tabular}
\end{center}
\renewcommand{\arraystretch}{1.0}


\textbf{Post-Normalization}: Rounding may cause overflow. In this case: Shift right once and increment exponent.