[SPCA] Finish multicore

2026-06-12 17:41:20 +02:00 · 2026-01-22 13:06:24 +01:00
parent f7eeac5470
commit adac7e17d3
7 changed files with 135 additions and 2 deletions
@@ -0,0 +1,28 @@
 #include <stdlib.h>
 struct qnode {
        struct qnode *next;
        int locked;
 };
 typedef struct qnode *lock_t;
 void acquire( lock_t *lock, struct qnode *local ) {
    local->next = NULL;
    struct qnode *prev = XCHG( lock, local );
    if ( prev ) { // queue was non-empty
        local->locked = 1;
        prev->next = local;
        while ( local->locked )
            ; // spin
    }
 }
 void release( lock_t *lock, struct qnode *local ) {
    if ( local->next == NULL ) {
        if ( CAS( lock, local, NULL ) )
            return;
        while ( local->next == NULL )
            ; // spin
    }
    local->next->locked = 0;
 }
@@ -15,7 +15,43 @@ Since we most commonly do not read a value of \texttt{0} in the lock memory loca
 \inputcodewithfilename{c}{}{code-examples/03_hw/01_tas.c}
 A word of caution: Do not use TAS to check if a value has changed outside a lock.
-It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages
+It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages.
 In systems code however, it works well, as TAS is a hardware instruction.
 The processor won't reorder instructions past a TAS and the compiler also should not.
 They are told with memory fences and the use of the volatile keyword.
 However, be aware that the code depends on the processor architecture and its memory consistency model.
-\content{CAS} (Compare-and-Swap)
+\content{CAS} (Compare-and-Swap) first loads the current value, then checks if it is the same as the ``old'' value
 and only if it is sets it to the ``new'' value.
 The typical function signature of a CAS function is \texttt{CAS(location, old, new)} and it typically returns the old value.
 In comparison to TAS, CAS is commonly used for lock-free programming, whereas TAS is almost an integral part to programming with locks.
 As already covered in parallel programming, we can use CAS to read a value, do a computation with it, then commit it back, only if the value was unchanged,
 or else restart our computation.
 Another option is to use a ref counter, which by default is set to 1 on an ``untouched'' data structure.
 The global pointer points to the current memory location and should only be modified using CAS.
 When a reader reads the data, it is incremented (\texttt{ref1 = 2}).
 Then, if a writer comes along, it reads and increments the ref counter (\texttt{ref1 = 3}),
 then copies the data and decrements the ref counter of the original data (\texttt{ref1 = 2}, \texttt{ref2 = 1} by convention).
 It then uses CAS to swap the global pointer to the new data and decrements the ref count (\texttt{ref1 = 1}).
 When the reader now finishes, it decrements the ref counter of the original data again (\texttt{ref1 = 0}) and the data is deleted.
 \content{ABA problem} occurs if one process (or processor) does not see an intermediate change $B$ because a second change wrote back the same value, $A$,
 as it was before the intermittent change.
 This problem is caused by the fact that CAS doesn't tell you if the variable had been overridden or not, but only if the value is different.
 An easy way to solve this is to simply use more bits and make e.g. the upper few bits an always incrementing counter.
 \content{DCAS} (Double CAS) is an alternative option for CAS, where you compare two memory locations and you only update if both match.
 However, this proved to not useful over the normal CAS and was subsequently dropped by most ISAs,
 as everything you could achieve with DCAS could also be achieved really fast with just CAS.
 \content{In \texttt{x86}} There are many options, the easiest one being to prefix an instruction with \texttt{lock}, which locks the data buses.
 If you only want to use TAS, you can use the \texttt{xchg} instruction.
 The \texttt{lock xadd} instruction executes an atomic fetch and add.
 The \texttt{CMPXCHG}, \texttt{CMPXCHG8B} and \texttt{CMPXCHG16B} implement \texttt{32 bit}, \texttt{64 bit} and \texttt{128 bit} CAS, respectively,
 where the \texttt{128 bit} version is exclusive to \texttt{x86-64}
@@ -0,0 +1,30 @@
 \subsubsection{Symmetric Multiprocessing}
 SMP allows multiple cores to access the same memory. 
 It does still allow each core to have a separate cache, which is a de-facto requirement for it to work, as otherwise the memory becomes an even more serious bottleneck.
 However, even with all the cache optimizations, the memory is still the bottleneck in SMP.
 The MOESI protocol can alleviate some of the slowness by enabling reads to be serviced by other caches,
 but that can again slow that cache down.
 Additionally, memory accesses can stall the current processor, as well as other processors while accessing data.
 So, to reduce idle times, we would like to issue instructions to the Functional Units (FUs), but Instruction Level Parallelism (ILP) is limited due to data dependencies.
 This is where SMT (Simultaneous Multithreading) comes into play. As of 2025, only AMD's consumer hardware features SMT, Intel has abandoned it with Arrow Lake (Core Ultra series).
 On most platforms, especially Intel platforms up to there, SMT was often referred to as Hyperthreading.
 It uses the fact that there are likely operations that can be served by cache in other threads and we can keep the FUs busy.
 There are three main ways of achieving SMT:
 \begin{itemize}
    \item \bi{Thread IDs} There are multiple independent instruction streams with each having their own thread IDs for later reassignment.
    \item \bi{Fine-grained MT} On every instruction dispatch, a thread to dispatch from is picked
    \item \bi{Coarse-grained MT} When a memory stall is encountered, switch to another thread
 \end{itemize}
 Of note is that CPU cores with multithreading appear to the OS as multiple cores, and in the common case of what AMD and Intel (up to Arrow Lake) do (did),
 they would appear as two cores, which is also where the ``Thread'' count comes from on CPU spec sheets.
 As nice and good of an idea as this sounds, it isn't necessarily faster, as threads might be competing for cache, which is why Intel has abandoned SMT altogether
 and they are seeing quite competitive performance.
 But then again, it is very cheap to implement in terms of extra transistors used, but the performance gain is in the 10-20\% range typically, if even.
 Finally, the performance gain strongly depends on the workload. Scientific computing rarely benefits from it, as the compute is often the limiting factor,
 however, web servers for example strongly benefit from it, as they are commonly severely memory constrained.
@@ -0,0 +1,28 @@
 \subsubsection{Non-Uniform Memory Access}
 In a typical early SMP architecture, more cores were provided, but not necessarily more cache, or even completely shared cache between the cores.
 This is where the NUMA concept comes in, restricting each core, or group of cores to a subset of the memory.
 This specifically advantageous in large data centers, where there might be hundreds of CPUs with Terabytes of memory.
 Initially, NUMA was constricted to data center uses, but with the Zen microarchitecture, AMD has brought the concept to the consumer market,
 where larger and more performant CPUs with larger core counts are split into multiple CCXs (Core CompleXes), each with cache per core and cache per CCX.
 Then, there is the Infinity Fabric, which is a CCX interconnect, allowing the cores from the other CCX(s) to access the data in the cache of the current CCX.
 However, that is not a full NUMA implementation, as they still share one memory bus.
 In a multi CPU deployment (i.e. with multiple CPU sockets containing a CPU), they often have their own memory and memory controller, as well as an interconnect,
 allowing them to communicate with the other sockets to access their data in the memory.
 In such a deployment, a CPU socket with its own memory is referred to as a NUMA Node.
 Let's look at an example with two AMD EPYC 7742 CPUs, each having 64 cores. Each of these CPUs features 128 PCIe Gen 4 lanes,
 64 of which could be used for the Infinity Fabric CPU to CPU interconnect (which uses the PCIe Interface).
 It is comparatively slow, maxing out approximately at a theoretical 128 GB/s in one direction.
 Compare that to the roughly 205 GB/s bandwidth of DDR4 3200 MT/s
 \content{Cache coherence} We can no longer snoop on the bus, as it is no bus anymore.
 A solution is to emulate a bus, which enables something akin to snooping, but without the shared bus.
 Each node sends a message to all other nodes and waits for a reply from all the other nodes before proceeding.
 This is the way AMD's Infinity Fabric works (or used to work)
 Another option to circumvent the issue is to use a Cache Directory, where we store the data, the node ID of which one is the one it originated from,
 plus one bit per node indicating if the line is present in that node.
 It is primarily efficient if the lines are not widely shared or if there are lots of NUMA nodes.
@@ -0,0 +1,11 @@
 \subsubsection{Performance and Optimization}
 Depending on the interconnect layout, accessing data from certain nodes is \textit{much} slower than from others.
 This especially happens if there is no direct path to the other node from the current node and the access needs to pass through other nodes.
 It is also of utmost importance to keep threads on the same CPU to improve locality, which improves cache coverage and try to pack data into a single cache line.
 Otherwise, the cache ``ping-pongs''.
 \content{MCS locks} There are many ways to improve performance on multiprocessor systems. The MCS lock is one of the best locking system for multiprocessors.
 It tries to solve the issue that if a cache line contains a lock, it is continuously invalidated and that dominates the intercom traffic.
 The solution to the problem is that a processor enqueues itself on a list of waiting processors and then spins on its \bi{own entry} in the list.
 \inputcodewithfilename{c}{}{code-examples/03_hw/02_mcs.c}