[SPCA] Finish multicore

2026-06-12 17:41:20 +02:00 · 2026-01-22 13:06:24 +01:00
parent f7eeac5470
commit adac7e17d3
7 changed files with 135 additions and 2 deletions
@@ -0,0 +1,28 @@
+#include <stdlib.h>
+
+struct qnode {
+        struct qnode *next;
+        int locked;
+};
+typedef struct qnode *lock_t;
+
+void acquire( lock_t *lock, struct qnode *local ) {
+    local->next = NULL;
+    struct qnode *prev = XCHG( lock, local );
+    if ( prev ) { // queue was non-empty
+        local->locked = 1;
+        prev->next = local;
+        while ( local->locked )
+            ; // spin
+    }
+}
+
+void release( lock_t *lock, struct qnode *local ) {
+    if ( local->next == NULL ) {
+        if ( CAS( lock, local, NULL ) )
+            return;
+        while ( local->next == NULL )
+            ; // spin
+    }
+    local->next->locked = 0;
+}
@@ -15,7 +15,43 @@ Since we most commonly do not read a value of \texttt{0} in the lock memory loca
 \inputcodewithfilename{c}{}{code-examples/03_hw/01_tas.c}

 A word of caution: Do not use TAS to check if a value has changed outside a lock.
-It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages
+It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages.
+In systems code however, it works well, as TAS is a hardware instruction.
+The processor won't reorder instructions past a TAS and the compiler also should not.
+They are told with memory fences and the use of the volatile keyword.
+
+However, be aware that the code depends on the processor architecture and its memory consistency model.


-\content{CAS} (Compare-and-Swap)
+\content{CAS} (Compare-and-Swap) first loads the current value, then checks if it is the same as the ``old'' value
+and only if it is sets it to the ``new'' value.
+The typical function signature of a CAS function is \texttt{CAS(location, old, new)} and it typically returns the old value.
+
+In comparison to TAS, CAS is commonly used for lock-free programming, whereas TAS is almost an integral part to programming with locks.
+As already covered in parallel programming, we can use CAS to read a value, do a computation with it, then commit it back, only if the value was unchanged,
+or else restart our computation.
+
+Another option is to use a ref counter, which by default is set to 1 on an ``untouched'' data structure.
+The global pointer points to the current memory location and should only be modified using CAS.
+
+When a reader reads the data, it is incremented (\texttt{ref1 = 2}).
+Then, if a writer comes along, it reads and increments the ref counter (\texttt{ref1 = 3}),
+then copies the data and decrements the ref counter of the original data (\texttt{ref1 = 2}, \texttt{ref2 = 1} by convention).
+It then uses CAS to swap the global pointer to the new data and decrements the ref count (\texttt{ref1 = 1}).
+When the reader now finishes, it decrements the ref counter of the original data again (\texttt{ref1 = 0}) and the data is deleted.
+
+\content{ABA problem} occurs if one process (or processor) does not see an intermediate change $B$ because a second change wrote back the same value, $A$,
+as it was before the intermittent change.
+This problem is caused by the fact that CAS doesn't tell you if the variable had been overridden or not, but only if the value is different.
+
+An easy way to solve this is to simply use more bits and make e.g. the upper few bits an always incrementing counter.
+
+\content{DCAS} (Double CAS) is an alternative option for CAS, where you compare two memory locations and you only update if both match.
+However, this proved to not useful over the normal CAS and was subsequently dropped by most ISAs,
+as everything you could achieve with DCAS could also be achieved really fast with just CAS.
+
+\content{In \texttt{x86}} There are many options, the easiest one being to prefix an instruction with \texttt{lock}, which locks the data buses.
+If you only want to use TAS, you can use the \texttt{xchg} instruction.
+The \texttt{lock xadd} instruction executes an atomic fetch and add.
+The \texttt{CMPXCHG}, \texttt{CMPXCHG8B} and \texttt{CMPXCHG16B} implement \texttt{32 bit}, \texttt{64 bit} and \texttt{128 bit} CAS, respectively,
+where the \texttt{128 bit} version is exclusive to \texttt{x86-64}
@@ -0,0 +1,30 @@
+\subsubsection{Symmetric Multiprocessing}
+SMP allows multiple cores to access the same memory. 
+It does still allow each core to have a separate cache, which is a de-facto requirement for it to work, as otherwise the memory becomes an even more serious bottleneck.
+
+However, even with all the cache optimizations, the memory is still the bottleneck in SMP.
+The MOESI protocol can alleviate some of the slowness by enabling reads to be serviced by other caches,
+but that can again slow that cache down.
+Additionally, memory accesses can stall the current processor, as well as other processors while accessing data.
+
+So, to reduce idle times, we would like to issue instructions to the Functional Units (FUs), but Instruction Level Parallelism (ILP) is limited due to data dependencies.
+
+This is where SMT (Simultaneous Multithreading) comes into play. As of 2025, only AMD's consumer hardware features SMT, Intel has abandoned it with Arrow Lake (Core Ultra series).
+On most platforms, especially Intel platforms up to there, SMT was often referred to as Hyperthreading.
+It uses the fact that there are likely operations that can be served by cache in other threads and we can keep the FUs busy.
+
+There are three main ways of achieving SMT:
+\begin{itemize}
+    \item \bi{Thread IDs} There are multiple independent instruction streams with each having their own thread IDs for later reassignment.
+    \item \bi{Fine-grained MT} On every instruction dispatch, a thread to dispatch from is picked
+    \item \bi{Coarse-grained MT} When a memory stall is encountered, switch to another thread
+\end{itemize}
+Of note is that CPU cores with multithreading appear to the OS as multiple cores, and in the common case of what AMD and Intel (up to Arrow Lake) do (did),
+they would appear as two cores, which is also where the ``Thread'' count comes from on CPU spec sheets.
+
+As nice and good of an idea as this sounds, it isn't necessarily faster, as threads might be competing for cache, which is why Intel has abandoned SMT altogether
+and they are seeing quite competitive performance.
+But then again, it is very cheap to implement in terms of extra transistors used, but the performance gain is in the 10-20\% range typically, if even.
+
+Finally, the performance gain strongly depends on the workload. Scientific computing rarely benefits from it, as the compute is often the limiting factor,
+however, web servers for example strongly benefit from it, as they are commonly severely memory constrained.
@@ -0,0 +1,28 @@
+\subsubsection{Non-Uniform Memory Access}
+In a typical early SMP architecture, more cores were provided, but not necessarily more cache, or even completely shared cache between the cores.
+
+This is where the NUMA concept comes in, restricting each core, or group of cores to a subset of the memory.
+This specifically advantageous in large data centers, where there might be hundreds of CPUs with Terabytes of memory.
+
+Initially, NUMA was constricted to data center uses, but with the Zen microarchitecture, AMD has brought the concept to the consumer market,
+where larger and more performant CPUs with larger core counts are split into multiple CCXs (Core CompleXes), each with cache per core and cache per CCX.
+Then, there is the Infinity Fabric, which is a CCX interconnect, allowing the cores from the other CCX(s) to access the data in the cache of the current CCX.
+
+However, that is not a full NUMA implementation, as they still share one memory bus.
+In a multi CPU deployment (i.e. with multiple CPU sockets containing a CPU), they often have their own memory and memory controller, as well as an interconnect,
+allowing them to communicate with the other sockets to access their data in the memory.
+In such a deployment, a CPU socket with its own memory is referred to as a NUMA Node.
+
+Let's look at an example with two AMD EPYC 7742 CPUs, each having 64 cores. Each of these CPUs features 128 PCIe Gen 4 lanes,
+64 of which could be used for the Infinity Fabric CPU to CPU interconnect (which uses the PCIe Interface).
+It is comparatively slow, maxing out approximately at a theoretical 128 GB/s in one direction.
+Compare that to the roughly 205 GB/s bandwidth of DDR4 3200 MT/s
+
+\content{Cache coherence} We can no longer snoop on the bus, as it is no bus anymore.
+A solution is to emulate a bus, which enables something akin to snooping, but without the shared bus.
+Each node sends a message to all other nodes and waits for a reply from all the other nodes before proceeding.
+This is the way AMD's Infinity Fabric works (or used to work)
+
+Another option to circumvent the issue is to use a Cache Directory, where we store the data, the node ID of which one is the one it originated from,
+plus one bit per node indicating if the line is present in that node.
+It is primarily efficient if the lines are not widely shared or if there are lots of NUMA nodes.
@@ -0,0 +1,11 @@
+\subsubsection{Performance and Optimization}
+Depending on the interconnect layout, accessing data from certain nodes is \textit{much} slower than from others.
+This especially happens if there is no direct path to the other node from the current node and the access needs to pass through other nodes.
+
+It is also of utmost importance to keep threads on the same CPU to improve locality, which improves cache coverage and try to pack data into a single cache line.
+Otherwise, the cache ``ping-pongs''.
+
+\content{MCS locks} There are many ways to improve performance on multiprocessor systems. The MCS lock is one of the best locking system for multiprocessors.
+It tries to solve the issue that if a cache line contains a lock, it is continuously invalidated and that dominates the intercom traffic.
+The solution to the problem is that a processor enqueues itself on a list of waiting processors and then spins on its \bi{own entry} in the list.
+\inputcodewithfilename{c}{}{code-examples/03_hw/02_mcs.c}