[SPCA] Virtual memory, Caches updated

2026-06-13 18:21:19 +02:00 · 2026-01-24 15:04:32 +01:00
parent 28fb91c42d
commit ff86cef50c
17 changed files with 319 additions and 108 deletions
@@ -1,3 +1,2 @@
-% This chapter was (almost) entirely written by Robin Bacher
 Hardware was only touched on briefly in this course, and knowledge of a previous Computer Architecture course is assumed.
 This section is similarly incomplete, only focusing on the specific topics covered in the lecture. 
@@ -1,28 +1,8 @@
 \subsection{Caches}

-Processors generally improve quicker than memory speed does. Therefore, optimizing memory is necessary, and this is what Caches do.
+Processors generally improve quicker than memory speed does. Therefore, optimizing memory is necessary, and caches are one such method.

-\content{Structure} Caches can be defined using $S, E, B$ s.t. $S \cdot E \cdot B = \text{Cache Size}$.\\
-$S = 2^s$ is the set count, $E = 2^e$ is the lines per set, and $B=2^b$ is the bytecount per cache block.
-
-\content{Address} Using the above, the address can be separated into fields which dictate the cache location:
-
-\begin{center}
-    Address:
-    \begin{tabular}{|c|c|c|}
-        \hline
-        tag & set index & block offset \\
-        \hline
-    \end{tabular}
-\end{center}
-
-Since we have $S=2^s$ sets and $B=2^b$ bytes per block, we need $s$ bits for the set index, $b$ bits for block offset. The remaining part (tag) is stored with the cache block and needs to match for a cache hit.
-
-\inlinedef \textbf{Direct-mapped} i.e. $E=1$ (1 cache line per set only).
-
-\inlinedef \textbf{2-way Set-Associative} i.e. $E=2$ (2 cache lines per set).
-
-\content{Example} The importance of caches can quickly be seen when looking at the memory hierarchy:
+\inlineex The importance of caches can quickly be seen when looking at the memory hierarchy:

 \begin{tabular}{llllll}
    \toprule
@@ -87,19 +67,3 @@ Since we have $S=2^s$ sets and $B=2^b$ bytes per block, we need $s$ bits for the
    Web proxy server               \\
    \bottomrule
 \end{tabular}
-
-\subsubsection{Cache Addressing Schemes}
-
-The cache can see either the virtual or physical address, and the tag and index do \textit{not} need to both use the physical/virtual address.
-
-\begin{center}
-    \begin{tabular}{c|c|c}
-        \hline
-        \textbf{Indexing}  & \textbf{Tagging}  & \textbf{Code} \\
-        \hline
-        Virtually Indexed  & Virtually Tagged  & VV            \\
-        Virtually Indexed  & Physically Tagged & VP            \\
-        Physically Indexed & Virtually Tagged  & PV            \\
-        Physically Indexed & Physically Tagged & PP
-    \end{tabular}
-\end{center}
@@ -0,0 +1,22 @@
+\subsubsection{Performance Metrics}
+\begin{itemize}[noitemsep]
+    \item \bi{Miss Rate} is the fraction of memory references that are not found in the cache. Defined as $\displaystyle \frac{\text{misses}}{\text{accesses}} = 1 - \text{hit rate}$
+          and is typically 3-10\% in L1 caches and less than 1\% in L2 caches.
+    \item \bi{Hit Time} is the time required to deliver a line in the cache to the processor and typically is 1-2 cycles for L1 cache and 5-20 cycles for L2 caches.
+    \item \bi{Miss penalty} is the additional time required when a cache miss occurs. Typically around 50-200 cycles
+\end{itemize}
+Judging by these numbers, it makes a huge difference if we hit the cache and the speed difference can easily exceed a factor of 100x.
+Of note is as well that a $99\%$ hit rate is twice as good as a $97\%$ hit rate with a miss penalty of 100 cycles and a cache hit time of 1 cycle:
+\begin{itemize}[noitemsep]
+    \item \bi{97\% hits:} $1 \text{ cycle} + 0.03 \cdot 100 \text{ cycles} = 4\text{ cycles}$
+    \item \bi{99\% hits:} $1 \text{ cycle} + 0.01 \cdot 100 \text{ cycles} = 2\text{ cycles}$
+\end{itemize}
+Thus, we always use \textit{miss rate} instead of hit rate.
+
+For a multi-level cache, we start with the last level cache and compute its miss penalty and combine that with the next higher level and so on (example with 2 level cache)
+\rmvspace
+\begin{align*}
+    \text{MissPenaltyL2} & = \text{DRAMaccessTime} + \frac{\text{BlockSize}}{\text{Bandwidth}} \\
+    \text{MissPenaltyL1} & = \text{HitTimeL2} + \text{MissRateL2} \cdot \text{MissPenaltyL2}\\
+    \text{AverageMemoryAccessTime} &= \text{HitTimeL1} + \text{MissRateL1} \cdot \text{MissPenaltyL1}
+\end{align*}
@@ -0,0 +1,11 @@
+\newpage
+\subsubsection{Cache misses}
+\begin{itemize}[noitemsep]
+    \item \bi{Compulsory / Cold miss} Occurs on the first access of a block (there can't be any data there yet)
+    \item \bi{Conflict miss} The cache may be large enough, but multiple lines may map to the current block,
+          e.g. referencing blocks 0, 8, 0, 8, \dots would miss every time if they are both mapped to the same cache line.
+          This is the typical behaviour for most caches.
+    \item \bi{Capacity miss} The number of active cache blocks is larger than the cache
+    \item \bi{Coherency miss} See in section \ref{sec:hw-multicore}.
+          They happen if cache lines have to be invalidated due in multiprocessor scenarios to preserve sequential consistency, etc
+\end{itemize}
@@ -0,0 +1,51 @@
+\subsubsection{Cache organization}
+\content{Structure} Caches can be defined using $S, E, B$ such that $S \cdot E \cdot B = \text{Cache Size}$.\\
+$S = 2^s$ is the set count, $E = 2^e$ is the lines per set, and $B=2^b$ is the number of bytes per cache block / cache line\footnote{
+    The terms \textit{cache block} and \textit{cache line} will be used interchangeably and sometimes shortened to just \textit{block} or \textit{line}}.
+
+Each block has the following structure where \texttt{v} is the valid bit:
+\begin{center}
+    Cache line:
+    \begin{tabular}{|c|c|c|}
+        \hline
+        v & tag & data ($B = 2^b$ bytes per block) \\
+        \hline
+    \end{tabular}
+\end{center}
+
+\content{Address} The cache address can be separated into fields which dictate the cache location:
+\begin{center}
+    Address:
+    \begin{tabular}{|c|c|c|}
+        \hline
+        tag & set index & block offset \\
+        \hline
+    \end{tabular}
+\end{center}
+Since we have $S=2^s$ sets and $B=2^b$ bytes per block, we need $s$ bits for the set index, $b$ bits for block offset.
+The remaining part (tag) is stored with the cache block and needs to match for a cache hit.
+
+Do note that the cache address is the same as the physical memory address, we just refer to it as cache address when talking about the interpretation the cache uses for it.
+See section \ref{sec:hw-cache-set} for more information
+
+
+Higher cache associativity helps reduce the number of conflict misses (often) by increasing the number of lines available for each block.
+It however, at the same cache size, reduces the number of available cache sets, which in turn may increase the number of conflict misses.
+
+\inlinedef \textbf{Direct-mapped} i.e. $E = 1$ (1 cache line per set only).
+
+\inlinedef \textbf{$N$-way Set-Associative} i.e. $E = N$ ($N$ cache lines per set, in this course we primarily covered $N = 2$).
+
+
+% ────────────────────────────────────────────────────────────────────
+
+\subsubsection{Determining cache set from memory address}
+\label{sec:hw-cache-set}
+The cache set a memory address is mapped to does not depend (directly) on the associativity of the cache\footnote{
+    The only dependence it has on the associativity is that with the same cache size, the number of sets is reduced if the associativity is increased}
+To be able to compute the cache block that a memory request maps to, we need to know the number of sets and the line size.
+
+We can then compute the set it maps to using \texttt{x \% S}, where \texttt{x = addr >> b} is the block number in the memory.
+
+Thus, in the cache address, the tag corresponds to \texttt{y = x / S}, so it indicates which actual memory location is stored in the cache
+(as multiple different memory block numbers will map to the same cache set).
@@ -0,0 +1,19 @@
+\newpage
+\subsubsection{Memory reads}
+\content{In Direct-Mapped caches}
+\begin{enumerate}[noitemsep]
+    \item We find the cache set and check if the valid bit is set. If not, skip to step 3, false
+    \item Check if the tag is equal to the requested tag.
+    \item If true, return the block at the correct offset, else evict the line and fetch the correct line (will be $B$ bytes) and return the block at the correct offset.
+\end{enumerate}
+
+
+\content{In $2$-way Set-Associative caches}
+\begin{enumerate}[noitemsep]
+    \item Find the corresponding cache set and compare the tag to both blocks.
+    \item If one matches, check its valid bit. If none match, go to step 5
+    \item If valid, return the block at the correct offset.
+    \item If invalid, evict the line, fetch the correct one. Return to step 3
+    \item If no match, evict one of the two (choose using a replacement policy like LRU (Least Recently Used) or randomly if the other block is not invalid),
+          fetch requested line, go to step 3.
+\end{enumerate}
@@ -0,0 +1,22 @@
+\subsubsection{Memory writes}
+Memory writes are just as slow, if not sometimes slower than memory reads.
+Thus, they also have to be cached to improve performance.
+Again, there are a few options to handle write caching and we will cover the two most prevalent ones:
+\begin{itemize}
+    \item \bi{Write-through} Here, the data is immediately written to main memory.
+          The obvious benefit is that the data is always up-to-date in the memory as well, but we do not gain any speed from doing that, it is thus very slow.
+    \item \bi{Write-back} We defer the write until the line is replaced (or sometimes other conditions or an explicit cache flush).
+          The obvious benefit is the increased speed, as we can benefit from the low cache access latency.
+          We do however need a \textit{dirty bit} to indicate that the cache line differs from main memory.
+          This introduces additional complexity, especially in multi-core situations, which is why in that case,
+          often a write-through mode is enabled for the variables that need atomicity.
+\end{itemize}
+
+Another question that arises is what to do on a \textit{write-miss}?
+\begin{itemize}
+    \item \bi{Write-allocate} The data is loaded into the cache and is beneficial if more writes to the location follow suite.
+          It is however harder to implement and it may evict an existing value from the cache.
+          This is commonly seen with write-back caches.
+    \item \bi{No-write-allocate} This writes to the memory immediately, is easier to implement, but again slower, especially if the value is later re-read.
+          This is commonly seen with write-through caches.
+\end{itemize}
@@ -0,0 +1,8 @@
+\subsubsection{Writing fast code}
+Improving the speed of the code can be done by optimizing the code's locality (both temporal and spatial locality).
+
+Computing a matrix product or iterating over a list can be much faster depending on the way you access the array and how it is stored.
+
+A common tactic to improve the throughput for these kinds of operations is to make sure that the order of operations is correct
+(i.e. for a row-major matrix to iterate over the elements of a row in the inner loop),
+or to do block multiplication where you multiply small blocks at once.
@@ -0,0 +1,17 @@
+\subsubsection{Cache Addressing Schemes}
+\label{sec:hw-addressing-schemes}
+
+The cache can see either the virtual or physical address, and the tag and index do \textit{not} need to both use the physical/virtual address.
+If this seems confusing, first read and understand section \ref{sec:hw-virt-mem}
+
+\begin{center}
+    \begin{tabular}{c|c|c}
+        \hline
+        \textbf{Indexing}  & \textbf{Tagging}  & \textbf{Code} \\
+        \hline
+        Virtually Indexed  & Virtually Tagged  & VV            \\
+        Virtually Indexed  & Physically Tagged & VP            \\
+        Physically Indexed & Virtually Tagged  & PV            \\
+        Physically Indexed & Physically Tagged & PP
+    \end{tabular}
+\end{center}
@@ -1,66 +0,0 @@
-\newpage
-\subsection{Virtual Memory}
-
-Conceptually, Assembly operations treat memory as a very large contiguous array of memory: Each byte has an individual address.
-\begin{minted}{gas}
-    movl    (%rcx), %eax    # Refers to a Virtual Address
-\end{minted}
-In truth of course, this is an abstraction for the memory hierarchy. Actual allocation is done by the compiler \& OS.
-
-The main advantages are:
-\begin{itemize}
-    \item Efficient use of (limited) RAM: Keep only active areas of virtual address space in memory
-    \item Simplifies memory management for programmers
-    \item Isolates address spaces: Processes can't interfere with other processes
-\end{itemize}
-
-\subsubsection{Address Translation}
-
-Address translation happens in a dedicated hardware component: The Memory Management Unit (MMU).
-
-Virtual and Physical Addresses share the same structure, buth the VPN is usually far longer than the PPN, since the virtual space is far bigger. Offsets match.
-
-\begin{multicols}{2}
-    \begin{center}
-        Virtual:
-        \begin{tabular}{|c|c|}
-            \hline
-            V. Page Number & V. Page Offset \\
-            \hline
-        \end{tabular}
-    \end{center}
-    \newcolumn
-    \begin{center}
-        Physical:
-        \begin{tabular}{|c|c|}
-            \hline
-            P. Page Number & P. Page Offset \\
-            \hline
-        \end{tabular}
-    \end{center}
-\end{multicols}
-
-The Page Table (Located at a special Page Table Base Register (PTBR)) contains the mapping $\text{VPN} \mapsto \text{PPN}$. Page Table Entries (PTE) are cached in the L1 cache like any other memory word.
-
-The Translation Lookaside Buffer (TLB) is a small hardware cache inside the MMU, which is faster than an L1 hit.\footnote{In practice, most address translations actually hit the TLB.}
-
-\content{Example} We consider $N=14$ bit virtual addresses and $M=12$ bit physical addresses. The offset takes $6$ bits.\footnote{The images in this example are from the SPCA lecture notes for FS25.}
-
-If we assume a TLB with $16$ entries, and $4$ way associativity, the VPN translates like this:
-
-\begin{center}
-    \includegraphics[width=0.7\linewidth]{images/VPN-to-TLB.png}
-\end{center}
-
-Similarly, if we assume a direct-mapped $16$ line cache with $4$ byte blocks:
-
-\begin{center}
-    \includegraphics[width=0.65\linewidth]{images/PPN-to-Cache.png}
-\end{center}
-
-Multi-Level page tables add further steps to this process: Instead of a PT we have a Page Directory Table which contains the addresses of separate Page Tables. The top of the VPN is used to index into each of these, which technically allows any depth of page tables.
-
-\subsubsection{x86 Virtual Memory}
-
-In \verb|x86-64| Virtual Addresses are $48$ bits long, yielding an address space of $256$TB.\\
-Physical Addresses are $52$ bits, with $40$ bit PPNs, yielding a page size of $4KB$.
@@ -0,0 +1,35 @@
+\newpage
+\subsection{Virtual Memory}
+\label{sec:hw-virt-mem}
+
+Conceptually, Assembly operations treat memory as a very large contiguous array of memory: Each byte has an individual address.
+\begin{minted}{gas}
+    movl    (%rcx), %eax    # Refers to a Virtual Address
+\end{minted}
+% FIXME: I don't fully agree with this, the compiler is not a thing that has a connection to pure assembly
+In truth of course, this is an abstraction for the memory hierarchy. Actual allocation is done by the compiler \& OS.
+% PROPOSED CHANGE (along these lines):
+% While that is convenient for the programmer, this is of course not reality and the physical address space is smaller.
+% The physical memory is used as a ``cache'' for the virtual memory, as the virtual memory pages are loaded into memory by the OS dynamically.
+
+
+The main advantages are:
+\begin{itemize}[noitemsep]
+    \item Efficient use of (limited) RAM: Keep only active areas of virtual address space in memory
+    \item Simplifies memory management for programmers
+    \item Isolates address spaces: Processes can't interfere with other processes
+\end{itemize}
+
+
+The reason virtual memory is even feasible is that most programs have great locality.
+The performance is good as long as the total virtual memory that is actively being used does not exceed the available physical memory.
+If that happens, we speak of \bi{Thrashing}, where the performance degrades significantly due to the large number of swaps occurring.
+
+Another benefit of virtual memory is that we can use the automated virtual to physical mapping to simplify memory allocation and management,
+since we don't have to manually seek out free physical pages anymore.
+Additionally, that serves as protection, as the OS can choose to allow certain processes to share memory,
+whilst it can disallow others to access the data by updating the page tables correctly.
+For that, the page table entries (PTEs) are extended to also include some permissions.
+
+As touched on already, this also allows for memory sharing, which can become useful for dynamic linking.
+These are just some of the benefits of virtual memory
@@ -0,0 +1,69 @@
+\subsubsection{Address Translation}
+
+Address translation happens in a dedicated hardware component: The Memory Management Unit (MMU).
+Thus, the CPU can use the virtual memory addresses and does not have to worry about translating to the physical addresses.
+
+Virtual and Physical Addresses share the same structure, but the VPN (Virtual Page Number) is usually far longer than the PPN (Physical Page Number),
+since the virtual space is far bigger. The offsets match however.
+
+\begin{multicols}{2}
+    \begin{center}
+        Virtual:
+        \begin{tabular}{|c|c|}
+            \hline
+            V. Page Number & V. Page Offset \\
+            \hline
+        \end{tabular}
+    \end{center}
+    \newcolumn
+    \begin{center}
+        Physical:
+        \begin{tabular}{|c|c|}
+            \hline
+            P. Page Number & P. Page Offset \\
+            \hline
+        \end{tabular}
+    \end{center}
+\end{multicols}
+
+The Page Table (PT) (Located at a special Page Table Base Register (PTBR)) contains the mapping $\text{VPN} \mapsto \text{PPN}$.
+Page Table Entries (PTE) are cached in the L1 cache like any other memory word.
+
+The Translation Lookaside Buffer (TLB) is a small hardware cache inside the MMU that is used to accelerate the PT lookups, which is typically faster than an L1 cache hit.
+The PT is usually stored in memory\footnote{In practice, most address translations actually hit the TLB.} and contains for each address the corresponding physical address,
+as well as a valid bit, which indicates if the page is in memory.
+
+If a page is not in memory a page fault is triggered, which transfers control to the OS, which then loads the page into memory,
+updates the page table and returns control back to the process. This and the inverse (i.e. unloading pages from memory onto the disk) is often referred to as \textit{swapping}.
+
+Due to the slowness of disks, page sizes are usually fairly large (4-8KB) typically, in some cases up to 4MB.
+The replacement policy algorithms are highly sophisticated and too complicated to be implemented in hardware and are thus usually handled by the operating system.
+
+\content{Address Translation with page hit} The CPU requests a virtual memory address from the MMU.
+It fetches the PTE from memory and sends the physical address to the memory system, which sends the data to the CPU.
+
+\content{Address Translation with page fault} During the check of the valid bit of the page table, the MMU find that it is not set.
+It thus triggers a page fault exception, which is then handled and a victim page (if necessary) is then picked and evicted (and if the dirty flag is set, it is paged out to disk).
+The handler then loads the new page into memory and updates the PT and the original instruction is then restarted on the CPU and the address translation will then succeed.
+
+As already touched on, the TLB can be used to speed up translation.
+\content{Address Translation with TLB Hit} When checking the TLB for the entry, it find the entry and we save one memory access.
+With the PPN retrieved, the memory system sends the data to the CPU.
+
+\content{Address Translation with TLB Miss} This works similar to the case when there is no TLB, as the TLB returns a miss signal for the request.
+Only that the PTE that is returned is inserted into the TLB via replacement policy (if applicable). The data is then fetched from the physical address and sent to the CPU.
+
+
+\content{Example} We consider $N=14$ bit virtual addresses and $M=12$ bit physical addresses. The offset takes $6$ bits.\footnote{The images in this example are from the SPCA lecture notes for FS25.}
+
+If we assume a TLB with $16$ entries, and $4$ way associativity, the VPN translates like this: \scriptsize(where \texttt{TLBT = Tag} and \texttt{TLBI = Set})\normalsize
+
+\begin{center}
+    \includegraphics[width=0.7\linewidth]{images/VPN-to-TLB.png}
+\end{center}
+
+Similarly, if we assume a direct-mapped $16$ line cache with $4$ byte blocks: \scriptsize(where \texttt{CT = Tag}, \texttt{CI = Set} and \texttt{CO = Offset})\normalsize
+
+\begin{center}
+    \includegraphics[width=0.65\linewidth]{images/PPN-to-Cache.png}
+\end{center}
@@ -0,0 +1,6 @@
+\subsubsection{Multilevel Page Tables}
+\content{Motivation} For a 48-bit Virtual Address Space with $4$KB ($= 2^{12}$ bytes) page size, the number of bits required is $2^{48} / 2^{12} \cdot 2^3 = 2^{39}$ bytes
+(that is 512 GB). The $2^{3}$ bytes is the size of the page table entry (8 bytes).
+
+Multi-Level page tables add further steps to this process: Instead of a PT we have a Page Directory Table (PDE) which contains the addresses of separate Page Tables.
+The top of the VPN is used to index into each of these, which technically allows any depth of page tables.
@@ -0,0 +1,41 @@
+\subsubsection{x86 Virtual Memory}
+
+In \verb|x86-64| Virtual Addresses are $48$ bits long, yielding an address space of $256$TB.\\
+Physical Addresses are $52$ bits, with $40$ bit PPNs, yielding a page size of $4KB$ (we thus have $64$ bit PTEs).
+
+On the slides, they are again using (as far as we can tell) a Skylake CPU (Core 6000 series, could also be Kaby Lake, Core 7000 series).
+
+On that architecture, the TLB contained the 40 bit PPN, a 32 bit TLB Tag, as well as
+\begin{itemize}
+    \item a valid bit (\texttt{V})
+    \item a global bit (\texttt{G}, coped from PDE / PTE and prevents eviction)
+    \item a supervisor-only bit (\texttt{S}, i.e. only accessible to OS, copied from PDE / PTE)
+    \item a writable bit (\texttt{W}, page is writable, copied from PDE / PTE)
+    \item a dirty bit (\texttt{D}, PTE has been marked dirty (i.e. modified vs memory))
+\end{itemize}
+There are a number of flags set and there are also a significant number of bits available for systems programmers to use on \texttt{x86}.
+Since they are highly unlikely to be exam-relevant, we will only point out that there are a lot of them (including setting supervisor mode, read/write mode, dirty, etc).
+To view them all, find them in the lecture slides of lecture 20, pages 87 through 90.
+
+For many years, cache sizes have been stagnant, this was due to the limited number of bits that could be used to efficiently determine if a line is available in the cache.
+Today, there are techniques to overcome that limitation and we have seen fairly substantial increases in cache sizes since
+(primarily from Team Red, starting with the AMD Ryzen 7 5800X3D).
+
+This was caused by the fact that only 6 bits from the PPO were used to determine the set and not more to improve performance,
+as the cache indexing could occur during address translation.
+
+\content{Addressing Schemes Revisited}
+Returning to the Addressing Schemes from section \ref{sec:hw-addressing-schemes}, it becomes evident that this is the key to solving the issue just touched on.
+If we virtually tag and virtually index the address, we have the issue that there may exist multiple multiple PA for each VA (i.e. it is context dependent).
+To circumvent that issue, an ASID (Address Space Identifier) is added to the tag.
+
+The Virtually Indexed, physically tagged scheme is what we have just seen and is commonly used for L1 caches.
+
+The Physically Tagged, Physically Indexed is the solution to the cache size restriction.
+It however suffers from slower access times, as the address translation has to complete before the cache line can be identified.
+
+\content{Write buffers} It is also common to have write buffers (which act like FIFO queue).
+It enables slower cache operations to complete that are typically associated with writing.
+
+\content{Large pages} Simply a lower number of bits remain in the PPN and some bits are ignored (we increased the ``offset-portion'' to 21 bits = 2MB).
+We can increase that to a page size of 1GB, if we increase the ``offset-portion'' further, to 30 bits to be precise.
@@ -1,5 +1,6 @@
 \newpage
 \subsection{Multi-Core}
+\label{sec:hw-multicore}
 \subsubsection{Background}
 In the early days of computer hardware it was fairly easy to get higher performance due to the rapid advances in transistor technology.
 However, today, what is known as Moore's Law (i.e. that the transistor count of integrated circuits doubles every two years).
@@ -87,6 +87,7 @@ If there are changes and you'd like to update this summary, please open a pull r
 \end{center}


+% ── x86 assembly ────────────────────────────────────────────────────
 \newpage
 \section{x86 Assembly}
 \input{parts/00_asm/00_intro.tex}
@@ -137,7 +138,7 @@ If there are changes and you'd like to update this summary, please open a pull r
 \input{parts/01_c/06_floating-point/06_math-properties_in-c.tex}


-
+% ── GCC ─────────────────────────────────────────────────────────────
 \newpage
 \section{The gcc toolchain}
 \input{parts/02_toolchain/00_intro.tex}
@@ -152,8 +153,19 @@ If there are changes and you'd like to update this summary, please open a pull r
 \input{parts/03_hw/00_intro.tex}
 \input{parts/03_hw/01_code-optim.tex}
 \input{parts/03_hw/02_vec-op.tex}
-\input{parts/03_hw/03_caches.tex}
-\input{parts/03_hw/04_virtual-memory.tex}
+\input{parts/03_hw/03_caches/00_intro.tex}
+\input{parts/03_hw/03_caches/01_perf-metrics.tex}
+\input{parts/03_hw/03_caches/02_misses.tex}
+\input{parts/03_hw/03_caches/03_organization.tex}
+\input{parts/03_hw/03_caches/04_reads.tex}
+\input{parts/03_hw/03_caches/05_writes.tex}
+\input{parts/03_hw/03_caches/06_optimizations.tex}
+\input{parts/03_hw/03_caches/07_addressing-schemes.tex}
+\input{parts/03_hw/04_virtual-memory/00_intro.tex}
+\input{parts/03_hw/04_virtual-memory/01_address-translation.tex}
+\input{parts/03_hw/04_virtual-memory/02_multilevel.tex}
+\input{parts/03_hw/04_virtual-memory/03_x86.tex}
+% \input{parts/03_hw/04_virtual-memory/}
 \input{parts/03_hw/05_exceptions.tex}
 \input{parts/03_hw/06_multicore/00_background.tex}
 \input{parts/03_hw/06_multicore/01_limitations.tex}