mirror of
https://github.com/janishutz/eth-summaries.git
synced 2026-03-14 10:50:05 +01:00
[SPCA] Additions to floating point
This commit is contained in:
@@ -5,4 +5,4 @@ Floating point numbers are a representation of real numbers.
|
|||||||
|
|
||||||
Though there are many ways to accomplish this, \textit{IEEE Standard 754} is used practically everywhere, also in \verb|x86|. This standard is a little more complicated than fractional binary numbers, but has a few numeric advantages, especially for representing very large (very small) numbers.
|
Though there are many ways to accomplish this, \textit{IEEE Standard 754} is used practically everywhere, also in \verb|x86|. This standard is a little more complicated than fractional binary numbers, but has a few numeric advantages, especially for representing very large (very small) numbers.
|
||||||
|
|
||||||
\hlurl{float.exposed}\ is an excellent website to understand floating point by example.
|
\hlhref{https://float.exposed}{float.exposed}\ is an excellent website to understand floating point by example.
|
||||||
|
|||||||
@@ -13,22 +13,26 @@ The default used is \textit{Nearest Even}\footnote{Changing the rounding mode is
|
|||||||
|
|
||||||
Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$
|
Rounding can be defined using 3 different bits from the \textit{exact} number: $G, R, S$
|
||||||
$$
|
$$
|
||||||
a = 1.BB\ldots BB\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}\underbrace{XX\ldots XX}_\text{Sticky}
|
a = 1.B_1B_2\ldots B_{n - 2}B_{n - 1}\underbrace{G}_\text{Guard}\underbrace{R}_\text{Round}
|
||||||
|
\underbrace{X_1X_2\ldots X_{k - 1}X_k}_\text{Sticky}
|
||||||
$$
|
$$
|
||||||
|
where $n$ is the number of bits in the mantissa of the format (e.g. $3$ as in the above example of an $8$bit floating point number).
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result
|
\item \textbf{Guard Bit} $G$ is the least significant bit of the (rounded) result (i.e. it is $B_n$)
|
||||||
\item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding
|
\item \textbf{Round Bit} $R$ is the $1$st bit cut off after rounding
|
||||||
\item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits.
|
\item \textbf{Sticky Bit} $S$ is the logical OR of all remaining cut off bits $X_i$.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
Based on these bits the rounding can be decided:
|
Based on these bits the rounding can be decided (we increment the rounded part if the expression evaluates to true):
|
||||||
|
\hrmvspace
|
||||||
$$
|
\begin{align*}
|
||||||
R \land S \implies \text{ Round up} \qquad\qquad
|
\text{Round up: } R \land S
|
||||||
G \land R \land \lnot S \implies \text{ Round to even}
|
& &
|
||||||
$$
|
\text{Round to even: } G \land R \land \lnot S
|
||||||
|
\end{align*}
|
||||||
|
|
||||||
|
\drmvspace
|
||||||
\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa):
|
\content{Example} Rounding $8$b precise results to $8$b precision floating point ($4$b mantissa):
|
||||||
|
|
||||||
\renewcommand{\arraystretch}{1.2}
|
\renewcommand{\arraystretch}{1.2}
|
||||||
|
|||||||
Binary file not shown.
Reference in New Issue
Block a user