[SPCA] Virtual memory, Caches updated

This commit is contained in:
2026-01-24 15:04:32 +01:00
parent 28fb91c42d
commit ff86cef50c
17 changed files with 319 additions and 108 deletions

View File

@@ -0,0 +1,35 @@
\newpage
\subsection{Virtual Memory}
\label{sec:hw-virt-mem}
Conceptually, Assembly operations treat memory as a very large contiguous array of memory: Each byte has an individual address.
\begin{minted}{gas}
movl (%rcx), %eax # Refers to a Virtual Address
\end{minted}
% FIXME: I don't fully agree with this, the compiler is not a thing that has a connection to pure assembly
In truth of course, this is an abstraction for the memory hierarchy. Actual allocation is done by the compiler \& OS.
% PROPOSED CHANGE (along these lines):
% While that is convenient for the programmer, this is of course not reality and the physical address space is smaller.
% The physical memory is used as a ``cache'' for the virtual memory, as the virtual memory pages are loaded into memory by the OS dynamically.
The main advantages are:
\begin{itemize}[noitemsep]
\item Efficient use of (limited) RAM: Keep only active areas of virtual address space in memory
\item Simplifies memory management for programmers
\item Isolates address spaces: Processes can't interfere with other processes
\end{itemize}
The reason virtual memory is even feasible is that most programs have great locality.
The performance is good as long as the total virtual memory that is actively being used does not exceed the available physical memory.
If that happens, we speak of \bi{Thrashing}, where the performance degrades significantly due to the large number of swaps occurring.
Another benefit of virtual memory is that we can use the automated virtual to physical mapping to simplify memory allocation and management,
since we don't have to manually seek out free physical pages anymore.
Additionally, that serves as protection, as the OS can choose to allow certain processes to share memory,
whilst it can disallow others to access the data by updating the page tables correctly.
For that, the page table entries (PTEs) are extended to also include some permissions.
As touched on already, this also allows for memory sharing, which can become useful for dynamic linking.
These are just some of the benefits of virtual memory

View File

@@ -0,0 +1,69 @@
\subsubsection{Address Translation}
Address translation happens in a dedicated hardware component: The Memory Management Unit (MMU).
Thus, the CPU can use the virtual memory addresses and does not have to worry about translating to the physical addresses.
Virtual and Physical Addresses share the same structure, but the VPN (Virtual Page Number) is usually far longer than the PPN (Physical Page Number),
since the virtual space is far bigger. The offsets match however.
\begin{multicols}{2}
\begin{center}
Virtual:
\begin{tabular}{|c|c|}
\hline
V. Page Number & V. Page Offset \\
\hline
\end{tabular}
\end{center}
\newcolumn
\begin{center}
Physical:
\begin{tabular}{|c|c|}
\hline
P. Page Number & P. Page Offset \\
\hline
\end{tabular}
\end{center}
\end{multicols}
The Page Table (PT) (Located at a special Page Table Base Register (PTBR)) contains the mapping $\text{VPN} \mapsto \text{PPN}$.
Page Table Entries (PTE) are cached in the L1 cache like any other memory word.
The Translation Lookaside Buffer (TLB) is a small hardware cache inside the MMU that is used to accelerate the PT lookups, which is typically faster than an L1 cache hit.
The PT is usually stored in memory\footnote{In practice, most address translations actually hit the TLB.} and contains for each address the corresponding physical address,
as well as a valid bit, which indicates if the page is in memory.
If a page is not in memory a page fault is triggered, which transfers control to the OS, which then loads the page into memory,
updates the page table and returns control back to the process. This and the inverse (i.e. unloading pages from memory onto the disk) is often referred to as \textit{swapping}.
Due to the slowness of disks, page sizes are usually fairly large (4-8KB) typically, in some cases up to 4MB.
The replacement policy algorithms are highly sophisticated and too complicated to be implemented in hardware and are thus usually handled by the operating system.
\content{Address Translation with page hit} The CPU requests a virtual memory address from the MMU.
It fetches the PTE from memory and sends the physical address to the memory system, which sends the data to the CPU.
\content{Address Translation with page fault} During the check of the valid bit of the page table, the MMU find that it is not set.
It thus triggers a page fault exception, which is then handled and a victim page (if necessary) is then picked and evicted (and if the dirty flag is set, it is paged out to disk).
The handler then loads the new page into memory and updates the PT and the original instruction is then restarted on the CPU and the address translation will then succeed.
As already touched on, the TLB can be used to speed up translation.
\content{Address Translation with TLB Hit} When checking the TLB for the entry, it find the entry and we save one memory access.
With the PPN retrieved, the memory system sends the data to the CPU.
\content{Address Translation with TLB Miss} This works similar to the case when there is no TLB, as the TLB returns a miss signal for the request.
Only that the PTE that is returned is inserted into the TLB via replacement policy (if applicable). The data is then fetched from the physical address and sent to the CPU.
\content{Example} We consider $N=14$ bit virtual addresses and $M=12$ bit physical addresses. The offset takes $6$ bits.\footnote{The images in this example are from the SPCA lecture notes for FS25.}
If we assume a TLB with $16$ entries, and $4$ way associativity, the VPN translates like this: \scriptsize(where \texttt{TLBT = Tag} and \texttt{TLBI = Set})\normalsize
\begin{center}
\includegraphics[width=0.7\linewidth]{images/VPN-to-TLB.png}
\end{center}
Similarly, if we assume a direct-mapped $16$ line cache with $4$ byte blocks: \scriptsize(where \texttt{CT = Tag}, \texttt{CI = Set} and \texttt{CO = Offset})\normalsize
\begin{center}
\includegraphics[width=0.65\linewidth]{images/PPN-to-Cache.png}
\end{center}

View File

@@ -0,0 +1,6 @@
\subsubsection{Multilevel Page Tables}
\content{Motivation} For a 48-bit Virtual Address Space with $4$KB ($= 2^{12}$ bytes) page size, the number of bits required is $2^{48} / 2^{12} \cdot 2^3 = 2^{39}$ bytes
(that is 512 GB). The $2^{3}$ bytes is the size of the page table entry (8 bytes).
Multi-Level page tables add further steps to this process: Instead of a PT we have a Page Directory Table (PDE) which contains the addresses of separate Page Tables.
The top of the VPN is used to index into each of these, which technically allows any depth of page tables.

View File

@@ -0,0 +1,41 @@
\subsubsection{x86 Virtual Memory}
In \verb|x86-64| Virtual Addresses are $48$ bits long, yielding an address space of $256$TB.\\
Physical Addresses are $52$ bits, with $40$ bit PPNs, yielding a page size of $4KB$ (we thus have $64$ bit PTEs).
On the slides, they are again using (as far as we can tell) a Skylake CPU (Core 6000 series, could also be Kaby Lake, Core 7000 series).
On that architecture, the TLB contained the 40 bit PPN, a 32 bit TLB Tag, as well as
\begin{itemize}
\item a valid bit (\texttt{V})
\item a global bit (\texttt{G}, coped from PDE / PTE and prevents eviction)
\item a supervisor-only bit (\texttt{S}, i.e. only accessible to OS, copied from PDE / PTE)
\item a writable bit (\texttt{W}, page is writable, copied from PDE / PTE)
\item a dirty bit (\texttt{D}, PTE has been marked dirty (i.e. modified vs memory))
\end{itemize}
There are a number of flags set and there are also a significant number of bits available for systems programmers to use on \texttt{x86}.
Since they are highly unlikely to be exam-relevant, we will only point out that there are a lot of them (including setting supervisor mode, read/write mode, dirty, etc).
To view them all, find them in the lecture slides of lecture 20, pages 87 through 90.
For many years, cache sizes have been stagnant, this was due to the limited number of bits that could be used to efficiently determine if a line is available in the cache.
Today, there are techniques to overcome that limitation and we have seen fairly substantial increases in cache sizes since
(primarily from Team Red, starting with the AMD Ryzen 7 5800X3D).
This was caused by the fact that only 6 bits from the PPO were used to determine the set and not more to improve performance,
as the cache indexing could occur during address translation.
\content{Addressing Schemes Revisited}
Returning to the Addressing Schemes from section \ref{sec:hw-addressing-schemes}, it becomes evident that this is the key to solving the issue just touched on.
If we virtually tag and virtually index the address, we have the issue that there may exist multiple multiple PA for each VA (i.e. it is context dependent).
To circumvent that issue, an ASID (Address Space Identifier) is added to the tag.
The Virtually Indexed, physically tagged scheme is what we have just seen and is commonly used for L1 caches.
The Physically Tagged, Physically Indexed is the solution to the cache size restriction.
It however suffers from slower access times, as the address translation has to complete before the cache line can be identified.
\content{Write buffers} It is also common to have write buffers (which act like FIFO queue).
It enables slower cache operations to complete that are typically associated with writing.
\content{Large pages} Simply a lower number of bits remain in the PPN and some bits are ignored (we increased the ``offset-portion'' to 21 bits = 2MB).
We can increase that to a page size of 1GB, if we increase the ``offset-portion'' further, to 30 bits to be precise.