diff --git a/semester3/spca/code-examples/03_hw/02_cas.c b/semester3/spca/code-examples/03_hw/02_cas.c deleted file mode 100644 index e69de29..0000000 diff --git a/semester3/spca/code-examples/03_hw/02_mcs.c b/semester3/spca/code-examples/03_hw/02_mcs.c new file mode 100644 index 0000000..e473acb --- /dev/null +++ b/semester3/spca/code-examples/03_hw/02_mcs.c @@ -0,0 +1,28 @@ +#include + +struct qnode { + struct qnode *next; + int locked; +}; +typedef struct qnode *lock_t; + +void acquire( lock_t *lock, struct qnode *local ) { + local->next = NULL; + struct qnode *prev = XCHG( lock, local ); + if ( prev ) { // queue was non-empty + local->locked = 1; + prev->next = local; + while ( local->locked ) + ; // spin + } +} + +void release( lock_t *lock, struct qnode *local ) { + if ( local->next == NULL ) { + if ( CAS( lock, local, NULL ) ) + return; + while ( local->next == NULL ) + ; // spin + } + local->next->locked = 0; +} diff --git a/semester3/spca/parts/03_hw/06_multicore/04_sync.tex b/semester3/spca/parts/03_hw/06_multicore/04_sync.tex index 42051c0..9bd8818 100644 --- a/semester3/spca/parts/03_hw/06_multicore/04_sync.tex +++ b/semester3/spca/parts/03_hw/06_multicore/04_sync.tex @@ -15,7 +15,43 @@ Since we most commonly do not read a value of \texttt{0} in the lock memory loca \inputcodewithfilename{c}{}{code-examples/03_hw/01_tas.c} A word of caution: Do not use TAS to check if a value has changed outside a lock. -It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages +It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages. +In systems code however, it works well, as TAS is a hardware instruction. +The processor won't reorder instructions past a TAS and the compiler also should not. +They are told with memory fences and the use of the volatile keyword. + +However, be aware that the code depends on the processor architecture and its memory consistency model. -\content{CAS} (Compare-and-Swap) +\content{CAS} (Compare-and-Swap) first loads the current value, then checks if it is the same as the ``old'' value +and only if it is sets it to the ``new'' value. +The typical function signature of a CAS function is \texttt{CAS(location, old, new)} and it typically returns the old value. + +In comparison to TAS, CAS is commonly used for lock-free programming, whereas TAS is almost an integral part to programming with locks. +As already covered in parallel programming, we can use CAS to read a value, do a computation with it, then commit it back, only if the value was unchanged, +or else restart our computation. + +Another option is to use a ref counter, which by default is set to 1 on an ``untouched'' data structure. +The global pointer points to the current memory location and should only be modified using CAS. + +When a reader reads the data, it is incremented (\texttt{ref1 = 2}). +Then, if a writer comes along, it reads and increments the ref counter (\texttt{ref1 = 3}), +then copies the data and decrements the ref counter of the original data (\texttt{ref1 = 2}, \texttt{ref2 = 1} by convention). +It then uses CAS to swap the global pointer to the new data and decrements the ref count (\texttt{ref1 = 1}). +When the reader now finishes, it decrements the ref counter of the original data again (\texttt{ref1 = 0}) and the data is deleted. + +\content{ABA problem} occurs if one process (or processor) does not see an intermediate change $B$ because a second change wrote back the same value, $A$, +as it was before the intermittent change. +This problem is caused by the fact that CAS doesn't tell you if the variable had been overridden or not, but only if the value is different. + +An easy way to solve this is to simply use more bits and make e.g. the upper few bits an always incrementing counter. + +\content{DCAS} (Double CAS) is an alternative option for CAS, where you compare two memory locations and you only update if both match. +However, this proved to not useful over the normal CAS and was subsequently dropped by most ISAs, +as everything you could achieve with DCAS could also be achieved really fast with just CAS. + +\content{In \texttt{x86}} There are many options, the easiest one being to prefix an instruction with \texttt{lock}, which locks the data buses. +If you only want to use TAS, you can use the \texttt{xchg} instruction. +The \texttt{lock xadd} instruction executes an atomic fetch and add. +The \texttt{CMPXCHG}, \texttt{CMPXCHG8B} and \texttt{CMPXCHG16B} implement \texttt{32 bit}, \texttt{64 bit} and \texttt{128 bit} CAS, respectively, +where the \texttt{128 bit} version is exclusive to \texttt{x86-64} diff --git a/semester3/spca/parts/03_hw/06_multicore/05_smp.tex b/semester3/spca/parts/03_hw/06_multicore/05_smp.tex index e69de29..3f421b4 100644 --- a/semester3/spca/parts/03_hw/06_multicore/05_smp.tex +++ b/semester3/spca/parts/03_hw/06_multicore/05_smp.tex @@ -0,0 +1,30 @@ +\subsubsection{Symmetric Multiprocessing} +SMP allows multiple cores to access the same memory. +It does still allow each core to have a separate cache, which is a de-facto requirement for it to work, as otherwise the memory becomes an even more serious bottleneck. + +However, even with all the cache optimizations, the memory is still the bottleneck in SMP. +The MOESI protocol can alleviate some of the slowness by enabling reads to be serviced by other caches, +but that can again slow that cache down. +Additionally, memory accesses can stall the current processor, as well as other processors while accessing data. + +So, to reduce idle times, we would like to issue instructions to the Functional Units (FUs), but Instruction Level Parallelism (ILP) is limited due to data dependencies. + +This is where SMT (Simultaneous Multithreading) comes into play. As of 2025, only AMD's consumer hardware features SMT, Intel has abandoned it with Arrow Lake (Core Ultra series). +On most platforms, especially Intel platforms up to there, SMT was often referred to as Hyperthreading. +It uses the fact that there are likely operations that can be served by cache in other threads and we can keep the FUs busy. + +There are three main ways of achieving SMT: +\begin{itemize} + \item \bi{Thread IDs} There are multiple independent instruction streams with each having their own thread IDs for later reassignment. + \item \bi{Fine-grained MT} On every instruction dispatch, a thread to dispatch from is picked + \item \bi{Coarse-grained MT} When a memory stall is encountered, switch to another thread +\end{itemize} +Of note is that CPU cores with multithreading appear to the OS as multiple cores, and in the common case of what AMD and Intel (up to Arrow Lake) do (did), +they would appear as two cores, which is also where the ``Thread'' count comes from on CPU spec sheets. + +As nice and good of an idea as this sounds, it isn't necessarily faster, as threads might be competing for cache, which is why Intel has abandoned SMT altogether +and they are seeing quite competitive performance. +But then again, it is very cheap to implement in terms of extra transistors used, but the performance gain is in the 10-20\% range typically, if even. + +Finally, the performance gain strongly depends on the workload. Scientific computing rarely benefits from it, as the compute is often the limiting factor, +however, web servers for example strongly benefit from it, as they are commonly severely memory constrained. diff --git a/semester3/spca/parts/03_hw/06_multicore/06_numa.tex b/semester3/spca/parts/03_hw/06_multicore/06_numa.tex index e69de29..92647f3 100644 --- a/semester3/spca/parts/03_hw/06_multicore/06_numa.tex +++ b/semester3/spca/parts/03_hw/06_multicore/06_numa.tex @@ -0,0 +1,28 @@ +\subsubsection{Non-Uniform Memory Access} +In a typical early SMP architecture, more cores were provided, but not necessarily more cache, or even completely shared cache between the cores. + +This is where the NUMA concept comes in, restricting each core, or group of cores to a subset of the memory. +This specifically advantageous in large data centers, where there might be hundreds of CPUs with Terabytes of memory. + +Initially, NUMA was constricted to data center uses, but with the Zen microarchitecture, AMD has brought the concept to the consumer market, +where larger and more performant CPUs with larger core counts are split into multiple CCXs (Core CompleXes), each with cache per core and cache per CCX. +Then, there is the Infinity Fabric, which is a CCX interconnect, allowing the cores from the other CCX(s) to access the data in the cache of the current CCX. + +However, that is not a full NUMA implementation, as they still share one memory bus. +In a multi CPU deployment (i.e. with multiple CPU sockets containing a CPU), they often have their own memory and memory controller, as well as an interconnect, +allowing them to communicate with the other sockets to access their data in the memory. +In such a deployment, a CPU socket with its own memory is referred to as a NUMA Node. + +Let's look at an example with two AMD EPYC 7742 CPUs, each having 64 cores. Each of these CPUs features 128 PCIe Gen 4 lanes, +64 of which could be used for the Infinity Fabric CPU to CPU interconnect (which uses the PCIe Interface). +It is comparatively slow, maxing out approximately at a theoretical 128 GB/s in one direction. +Compare that to the roughly 205 GB/s bandwidth of DDR4 3200 MT/s + +\content{Cache coherence} We can no longer snoop on the bus, as it is no bus anymore. +A solution is to emulate a bus, which enables something akin to snooping, but without the shared bus. +Each node sends a message to all other nodes and waits for a reply from all the other nodes before proceeding. +This is the way AMD's Infinity Fabric works (or used to work) + +Another option to circumvent the issue is to use a Cache Directory, where we store the data, the node ID of which one is the one it originated from, +plus one bit per node indicating if the line is present in that node. +It is primarily efficient if the lines are not widely shared or if there are lots of NUMA nodes. diff --git a/semester3/spca/parts/03_hw/06_multicore/07_optim.tex b/semester3/spca/parts/03_hw/06_multicore/07_optim.tex index e69de29..4583916 100644 --- a/semester3/spca/parts/03_hw/06_multicore/07_optim.tex +++ b/semester3/spca/parts/03_hw/06_multicore/07_optim.tex @@ -0,0 +1,11 @@ +\subsubsection{Performance and Optimization} +Depending on the interconnect layout, accessing data from certain nodes is \textit{much} slower than from others. +This especially happens if there is no direct path to the other node from the current node and the access needs to pass through other nodes. + +It is also of utmost importance to keep threads on the same CPU to improve locality, which improves cache coverage and try to pack data into a single cache line. +Otherwise, the cache ``ping-pongs''. + +\content{MCS locks} There are many ways to improve performance on multiprocessor systems. The MCS lock is one of the best locking system for multiprocessors. +It tries to solve the issue that if a cache line contains a lock, it is continuously invalidated and that dominates the intercom traffic. +The solution to the problem is that a processor enqueues itself on a list of waiting processors and then spins on its \bi{own entry} in the list. +\inputcodewithfilename{c}{}{code-examples/03_hw/02_mcs.c} diff --git a/semester3/spca/spca-summary.pdf b/semester3/spca/spca-summary.pdf index c59d2bc..68b09a9 100644 Binary files a/semester3/spca/spca-summary.pdf and b/semester3/spca/spca-summary.pdf differ