[SPCA] Virtual memory, Caches updated

2026-06-12 23:51:20 +02:00 · 2026-01-24 15:04:32 +01:00
parent 28fb91c42d
commit ff86cef50c
17 changed files with 319 additions and 108 deletions
@@ -0,0 +1,69 @@
+\subsection{Caches}
+
+Processors generally improve quicker than memory speed does. Therefore, optimizing memory is necessary, and caches are one such method.
+
+\inlineex The importance of caches can quickly be seen when looking at the memory hierarchy:
+
+\begin{tabular}{llllll}
+    \toprule
+    \textbf{Cache type}          &
+    \textbf{What is cached?}     &
+    \textbf{Where is it cached?} &
+    \textbf{Latency (cycles)}    &
+    \textbf{Managed by}            \\
+    \midrule
+    Registers                    &
+    4/8-byte words               &
+    CPU core                     &
+    0                            &
+    Compiler                       \\
+
+    TLB                          &
+    Address translations         &
+    On-chip TLB                  &
+    0                            &
+    Hardware                       \\
+
+    L1 cache                     &
+    64-byte blocks               &
+    On-chip L1                   &
+    1                            &
+    Hardware                       \\
+
+    L2 cache                     &
+    64-byte blocks               &
+    On-chip L2                   &
+    10                           &
+    Hardware                       \\
+
+    Virtual memory               &
+    4\,kB page                   &
+    Main memory (RAM)            &
+    100                          &
+    Hardware + OS                  \\
+
+    Buffer cache                 &
+    4\,kB sectors                &
+    Main memory                  &
+    100                          &
+    OS                             \\
+
+    Network buffer cache         &
+    Parts of files               &
+    Local disk, SSD              &
+    1{,}000{,}000                &
+    SMB/NFS client                 \\
+
+    Browser cache                &
+    Web pages                    &
+    Local disk                   &
+    10{,}000{,}000               &
+    Web browser                    \\
+
+    Web cache                    &
+    Web pages                    &
+    Remote server disks          &
+    1{,}000{,}000{,}000          &
+    Web proxy server               \\
+    \bottomrule
+\end{tabular}
@@ -0,0 +1,22 @@
+\subsubsection{Performance Metrics}
+\begin{itemize}[noitemsep]
+    \item \bi{Miss Rate} is the fraction of memory references that are not found in the cache. Defined as $\displaystyle \frac{\text{misses}}{\text{accesses}} = 1 - \text{hit rate}$
+          and is typically 3-10\% in L1 caches and less than 1\% in L2 caches.
+    \item \bi{Hit Time} is the time required to deliver a line in the cache to the processor and typically is 1-2 cycles for L1 cache and 5-20 cycles for L2 caches.
+    \item \bi{Miss penalty} is the additional time required when a cache miss occurs. Typically around 50-200 cycles
+\end{itemize}
+Judging by these numbers, it makes a huge difference if we hit the cache and the speed difference can easily exceed a factor of 100x.
+Of note is as well that a $99\%$ hit rate is twice as good as a $97\%$ hit rate with a miss penalty of 100 cycles and a cache hit time of 1 cycle:
+\begin{itemize}[noitemsep]
+    \item \bi{97\% hits:} $1 \text{ cycle} + 0.03 \cdot 100 \text{ cycles} = 4\text{ cycles}$
+    \item \bi{99\% hits:} $1 \text{ cycle} + 0.01 \cdot 100 \text{ cycles} = 2\text{ cycles}$
+\end{itemize}
+Thus, we always use \textit{miss rate} instead of hit rate.
+
+For a multi-level cache, we start with the last level cache and compute its miss penalty and combine that with the next higher level and so on (example with 2 level cache)
+\rmvspace
+\begin{align*}
+    \text{MissPenaltyL2} & = \text{DRAMaccessTime} + \frac{\text{BlockSize}}{\text{Bandwidth}} \\
+    \text{MissPenaltyL1} & = \text{HitTimeL2} + \text{MissRateL2} \cdot \text{MissPenaltyL2}\\
+    \text{AverageMemoryAccessTime} &= \text{HitTimeL1} + \text{MissRateL1} \cdot \text{MissPenaltyL1}
+\end{align*}
@@ -0,0 +1,11 @@
+\newpage
+\subsubsection{Cache misses}
+\begin{itemize}[noitemsep]
+    \item \bi{Compulsory / Cold miss} Occurs on the first access of a block (there can't be any data there yet)
+    \item \bi{Conflict miss} The cache may be large enough, but multiple lines may map to the current block,
+          e.g. referencing blocks 0, 8, 0, 8, \dots would miss every time if they are both mapped to the same cache line.
+          This is the typical behaviour for most caches.
+    \item \bi{Capacity miss} The number of active cache blocks is larger than the cache
+    \item \bi{Coherency miss} See in section \ref{sec:hw-multicore}.
+          They happen if cache lines have to be invalidated due in multiprocessor scenarios to preserve sequential consistency, etc
+\end{itemize}
@@ -0,0 +1,51 @@
+\subsubsection{Cache organization}
+\content{Structure} Caches can be defined using $S, E, B$ such that $S \cdot E \cdot B = \text{Cache Size}$.\\
+$S = 2^s$ is the set count, $E = 2^e$ is the lines per set, and $B=2^b$ is the number of bytes per cache block / cache line\footnote{
+    The terms \textit{cache block} and \textit{cache line} will be used interchangeably and sometimes shortened to just \textit{block} or \textit{line}}.
+
+Each block has the following structure where \texttt{v} is the valid bit:
+\begin{center}
+    Cache line:
+    \begin{tabular}{|c|c|c|}
+        \hline
+        v & tag & data ($B = 2^b$ bytes per block) \\
+        \hline
+    \end{tabular}
+\end{center}
+
+\content{Address} The cache address can be separated into fields which dictate the cache location:
+\begin{center}
+    Address:
+    \begin{tabular}{|c|c|c|}
+        \hline
+        tag & set index & block offset \\
+        \hline
+    \end{tabular}
+\end{center}
+Since we have $S=2^s$ sets and $B=2^b$ bytes per block, we need $s$ bits for the set index, $b$ bits for block offset.
+The remaining part (tag) is stored with the cache block and needs to match for a cache hit.
+
+Do note that the cache address is the same as the physical memory address, we just refer to it as cache address when talking about the interpretation the cache uses for it.
+See section \ref{sec:hw-cache-set} for more information
+
+
+Higher cache associativity helps reduce the number of conflict misses (often) by increasing the number of lines available for each block.
+It however, at the same cache size, reduces the number of available cache sets, which in turn may increase the number of conflict misses.
+
+\inlinedef \textbf{Direct-mapped} i.e. $E = 1$ (1 cache line per set only).
+
+\inlinedef \textbf{$N$-way Set-Associative} i.e. $E = N$ ($N$ cache lines per set, in this course we primarily covered $N = 2$).
+
+
+% ────────────────────────────────────────────────────────────────────
+
+\subsubsection{Determining cache set from memory address}
+\label{sec:hw-cache-set}
+The cache set a memory address is mapped to does not depend (directly) on the associativity of the cache\footnote{
+    The only dependence it has on the associativity is that with the same cache size, the number of sets is reduced if the associativity is increased}
+To be able to compute the cache block that a memory request maps to, we need to know the number of sets and the line size.
+
+We can then compute the set it maps to using \texttt{x \% S}, where \texttt{x = addr >> b} is the block number in the memory.
+
+Thus, in the cache address, the tag corresponds to \texttt{y = x / S}, so it indicates which actual memory location is stored in the cache
+(as multiple different memory block numbers will map to the same cache set).
@@ -0,0 +1,19 @@
+\newpage
+\subsubsection{Memory reads}
+\content{In Direct-Mapped caches}
+\begin{enumerate}[noitemsep]
+    \item We find the cache set and check if the valid bit is set. If not, skip to step 3, false
+    \item Check if the tag is equal to the requested tag.
+    \item If true, return the block at the correct offset, else evict the line and fetch the correct line (will be $B$ bytes) and return the block at the correct offset.
+\end{enumerate}
+
+
+\content{In $2$-way Set-Associative caches}
+\begin{enumerate}[noitemsep]
+    \item Find the corresponding cache set and compare the tag to both blocks.
+    \item If one matches, check its valid bit. If none match, go to step 5
+    \item If valid, return the block at the correct offset.
+    \item If invalid, evict the line, fetch the correct one. Return to step 3
+    \item If no match, evict one of the two (choose using a replacement policy like LRU (Least Recently Used) or randomly if the other block is not invalid),
+          fetch requested line, go to step 3.
+\end{enumerate}
@@ -0,0 +1,22 @@
+\subsubsection{Memory writes}
+Memory writes are just as slow, if not sometimes slower than memory reads.
+Thus, they also have to be cached to improve performance.
+Again, there are a few options to handle write caching and we will cover the two most prevalent ones:
+\begin{itemize}
+    \item \bi{Write-through} Here, the data is immediately written to main memory.
+          The obvious benefit is that the data is always up-to-date in the memory as well, but we do not gain any speed from doing that, it is thus very slow.
+    \item \bi{Write-back} We defer the write until the line is replaced (or sometimes other conditions or an explicit cache flush).
+          The obvious benefit is the increased speed, as we can benefit from the low cache access latency.
+          We do however need a \textit{dirty bit} to indicate that the cache line differs from main memory.
+          This introduces additional complexity, especially in multi-core situations, which is why in that case,
+          often a write-through mode is enabled for the variables that need atomicity.
+\end{itemize}
+
+Another question that arises is what to do on a \textit{write-miss}?
+\begin{itemize}
+    \item \bi{Write-allocate} The data is loaded into the cache and is beneficial if more writes to the location follow suite.
+          It is however harder to implement and it may evict an existing value from the cache.
+          This is commonly seen with write-back caches.
+    \item \bi{No-write-allocate} This writes to the memory immediately, is easier to implement, but again slower, especially if the value is later re-read.
+          This is commonly seen with write-through caches.
+\end{itemize}
@@ -0,0 +1,8 @@
+\subsubsection{Writing fast code}
+Improving the speed of the code can be done by optimizing the code's locality (both temporal and spatial locality).
+
+Computing a matrix product or iterating over a list can be much faster depending on the way you access the array and how it is stored.
+
+A common tactic to improve the throughput for these kinds of operations is to make sure that the order of operations is correct
+(i.e. for a row-major matrix to iterate over the elements of a row in the inner loop),
+or to do block multiplication where you multiply small blocks at once.
@@ -0,0 +1,17 @@
+\subsubsection{Cache Addressing Schemes}
+\label{sec:hw-addressing-schemes}
+
+The cache can see either the virtual or physical address, and the tag and index do \textit{not} need to both use the physical/virtual address.
+If this seems confusing, first read and understand section \ref{sec:hw-virt-mem}
+
+\begin{center}
+    \begin{tabular}{c|c|c}
+        \hline
+        \textbf{Indexing}  & \textbf{Tagging}  & \textbf{Code} \\
+        \hline
+        Virtually Indexed  & Virtually Tagged  & VV            \\
+        Virtually Indexed  & Physically Tagged & VP            \\
+        Physically Indexed & Virtually Tagged  & PV            \\
+        Physically Indexed & Physically Tagged & PP
+    \end{tabular}
+\end{center}