mirror of
https://github.com/janishutz/eth-summaries.git
synced 2026-03-14 10:50:05 +01:00
[SPCA] Finish multicore
This commit is contained in:
28
semester3/spca/code-examples/03_hw/02_mcs.c
Normal file
28
semester3/spca/code-examples/03_hw/02_mcs.c
Normal file
@@ -0,0 +1,28 @@
|
||||
#include <stdlib.h>
|
||||
|
||||
struct qnode {
|
||||
struct qnode *next;
|
||||
int locked;
|
||||
};
|
||||
typedef struct qnode *lock_t;
|
||||
|
||||
void acquire( lock_t *lock, struct qnode *local ) {
|
||||
local->next = NULL;
|
||||
struct qnode *prev = XCHG( lock, local );
|
||||
if ( prev ) { // queue was non-empty
|
||||
local->locked = 1;
|
||||
prev->next = local;
|
||||
while ( local->locked )
|
||||
; // spin
|
||||
}
|
||||
}
|
||||
|
||||
void release( lock_t *lock, struct qnode *local ) {
|
||||
if ( local->next == NULL ) {
|
||||
if ( CAS( lock, local, NULL ) )
|
||||
return;
|
||||
while ( local->next == NULL )
|
||||
; // spin
|
||||
}
|
||||
local->next->locked = 0;
|
||||
}
|
||||
@@ -15,7 +15,43 @@ Since we most commonly do not read a value of \texttt{0} in the lock memory loca
|
||||
\inputcodewithfilename{c}{}{code-examples/03_hw/01_tas.c}
|
||||
|
||||
A word of caution: Do not use TAS to check if a value has changed outside a lock.
|
||||
It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages
|
||||
It will most likely not not work in \lC\ and almost certainly not in \texttt{Java} or any higher level languages.
|
||||
In systems code however, it works well, as TAS is a hardware instruction.
|
||||
The processor won't reorder instructions past a TAS and the compiler also should not.
|
||||
They are told with memory fences and the use of the volatile keyword.
|
||||
|
||||
However, be aware that the code depends on the processor architecture and its memory consistency model.
|
||||
|
||||
|
||||
\content{CAS} (Compare-and-Swap)
|
||||
\content{CAS} (Compare-and-Swap) first loads the current value, then checks if it is the same as the ``old'' value
|
||||
and only if it is sets it to the ``new'' value.
|
||||
The typical function signature of a CAS function is \texttt{CAS(location, old, new)} and it typically returns the old value.
|
||||
|
||||
In comparison to TAS, CAS is commonly used for lock-free programming, whereas TAS is almost an integral part to programming with locks.
|
||||
As already covered in parallel programming, we can use CAS to read a value, do a computation with it, then commit it back, only if the value was unchanged,
|
||||
or else restart our computation.
|
||||
|
||||
Another option is to use a ref counter, which by default is set to 1 on an ``untouched'' data structure.
|
||||
The global pointer points to the current memory location and should only be modified using CAS.
|
||||
|
||||
When a reader reads the data, it is incremented (\texttt{ref1 = 2}).
|
||||
Then, if a writer comes along, it reads and increments the ref counter (\texttt{ref1 = 3}),
|
||||
then copies the data and decrements the ref counter of the original data (\texttt{ref1 = 2}, \texttt{ref2 = 1} by convention).
|
||||
It then uses CAS to swap the global pointer to the new data and decrements the ref count (\texttt{ref1 = 1}).
|
||||
When the reader now finishes, it decrements the ref counter of the original data again (\texttt{ref1 = 0}) and the data is deleted.
|
||||
|
||||
\content{ABA problem} occurs if one process (or processor) does not see an intermediate change $B$ because a second change wrote back the same value, $A$,
|
||||
as it was before the intermittent change.
|
||||
This problem is caused by the fact that CAS doesn't tell you if the variable had been overridden or not, but only if the value is different.
|
||||
|
||||
An easy way to solve this is to simply use more bits and make e.g. the upper few bits an always incrementing counter.
|
||||
|
||||
\content{DCAS} (Double CAS) is an alternative option for CAS, where you compare two memory locations and you only update if both match.
|
||||
However, this proved to not useful over the normal CAS and was subsequently dropped by most ISAs,
|
||||
as everything you could achieve with DCAS could also be achieved really fast with just CAS.
|
||||
|
||||
\content{In \texttt{x86}} There are many options, the easiest one being to prefix an instruction with \texttt{lock}, which locks the data buses.
|
||||
If you only want to use TAS, you can use the \texttt{xchg} instruction.
|
||||
The \texttt{lock xadd} instruction executes an atomic fetch and add.
|
||||
The \texttt{CMPXCHG}, \texttt{CMPXCHG8B} and \texttt{CMPXCHG16B} implement \texttt{32 bit}, \texttt{64 bit} and \texttt{128 bit} CAS, respectively,
|
||||
where the \texttt{128 bit} version is exclusive to \texttt{x86-64}
|
||||
|
||||
@@ -0,0 +1,30 @@
|
||||
\subsubsection{Symmetric Multiprocessing}
|
||||
SMP allows multiple cores to access the same memory.
|
||||
It does still allow each core to have a separate cache, which is a de-facto requirement for it to work, as otherwise the memory becomes an even more serious bottleneck.
|
||||
|
||||
However, even with all the cache optimizations, the memory is still the bottleneck in SMP.
|
||||
The MOESI protocol can alleviate some of the slowness by enabling reads to be serviced by other caches,
|
||||
but that can again slow that cache down.
|
||||
Additionally, memory accesses can stall the current processor, as well as other processors while accessing data.
|
||||
|
||||
So, to reduce idle times, we would like to issue instructions to the Functional Units (FUs), but Instruction Level Parallelism (ILP) is limited due to data dependencies.
|
||||
|
||||
This is where SMT (Simultaneous Multithreading) comes into play. As of 2025, only AMD's consumer hardware features SMT, Intel has abandoned it with Arrow Lake (Core Ultra series).
|
||||
On most platforms, especially Intel platforms up to there, SMT was often referred to as Hyperthreading.
|
||||
It uses the fact that there are likely operations that can be served by cache in other threads and we can keep the FUs busy.
|
||||
|
||||
There are three main ways of achieving SMT:
|
||||
\begin{itemize}
|
||||
\item \bi{Thread IDs} There are multiple independent instruction streams with each having their own thread IDs for later reassignment.
|
||||
\item \bi{Fine-grained MT} On every instruction dispatch, a thread to dispatch from is picked
|
||||
\item \bi{Coarse-grained MT} When a memory stall is encountered, switch to another thread
|
||||
\end{itemize}
|
||||
Of note is that CPU cores with multithreading appear to the OS as multiple cores, and in the common case of what AMD and Intel (up to Arrow Lake) do (did),
|
||||
they would appear as two cores, which is also where the ``Thread'' count comes from on CPU spec sheets.
|
||||
|
||||
As nice and good of an idea as this sounds, it isn't necessarily faster, as threads might be competing for cache, which is why Intel has abandoned SMT altogether
|
||||
and they are seeing quite competitive performance.
|
||||
But then again, it is very cheap to implement in terms of extra transistors used, but the performance gain is in the 10-20\% range typically, if even.
|
||||
|
||||
Finally, the performance gain strongly depends on the workload. Scientific computing rarely benefits from it, as the compute is often the limiting factor,
|
||||
however, web servers for example strongly benefit from it, as they are commonly severely memory constrained.
|
||||
|
||||
@@ -0,0 +1,28 @@
|
||||
\subsubsection{Non-Uniform Memory Access}
|
||||
In a typical early SMP architecture, more cores were provided, but not necessarily more cache, or even completely shared cache between the cores.
|
||||
|
||||
This is where the NUMA concept comes in, restricting each core, or group of cores to a subset of the memory.
|
||||
This specifically advantageous in large data centers, where there might be hundreds of CPUs with Terabytes of memory.
|
||||
|
||||
Initially, NUMA was constricted to data center uses, but with the Zen microarchitecture, AMD has brought the concept to the consumer market,
|
||||
where larger and more performant CPUs with larger core counts are split into multiple CCXs (Core CompleXes), each with cache per core and cache per CCX.
|
||||
Then, there is the Infinity Fabric, which is a CCX interconnect, allowing the cores from the other CCX(s) to access the data in the cache of the current CCX.
|
||||
|
||||
However, that is not a full NUMA implementation, as they still share one memory bus.
|
||||
In a multi CPU deployment (i.e. with multiple CPU sockets containing a CPU), they often have their own memory and memory controller, as well as an interconnect,
|
||||
allowing them to communicate with the other sockets to access their data in the memory.
|
||||
In such a deployment, a CPU socket with its own memory is referred to as a NUMA Node.
|
||||
|
||||
Let's look at an example with two AMD EPYC 7742 CPUs, each having 64 cores. Each of these CPUs features 128 PCIe Gen 4 lanes,
|
||||
64 of which could be used for the Infinity Fabric CPU to CPU interconnect (which uses the PCIe Interface).
|
||||
It is comparatively slow, maxing out approximately at a theoretical 128 GB/s in one direction.
|
||||
Compare that to the roughly 205 GB/s bandwidth of DDR4 3200 MT/s
|
||||
|
||||
\content{Cache coherence} We can no longer snoop on the bus, as it is no bus anymore.
|
||||
A solution is to emulate a bus, which enables something akin to snooping, but without the shared bus.
|
||||
Each node sends a message to all other nodes and waits for a reply from all the other nodes before proceeding.
|
||||
This is the way AMD's Infinity Fabric works (or used to work)
|
||||
|
||||
Another option to circumvent the issue is to use a Cache Directory, where we store the data, the node ID of which one is the one it originated from,
|
||||
plus one bit per node indicating if the line is present in that node.
|
||||
It is primarily efficient if the lines are not widely shared or if there are lots of NUMA nodes.
|
||||
|
||||
@@ -0,0 +1,11 @@
|
||||
\subsubsection{Performance and Optimization}
|
||||
Depending on the interconnect layout, accessing data from certain nodes is \textit{much} slower than from others.
|
||||
This especially happens if there is no direct path to the other node from the current node and the access needs to pass through other nodes.
|
||||
|
||||
It is also of utmost importance to keep threads on the same CPU to improve locality, which improves cache coverage and try to pack data into a single cache line.
|
||||
Otherwise, the cache ``ping-pongs''.
|
||||
|
||||
\content{MCS locks} There are many ways to improve performance on multiprocessor systems. The MCS lock is one of the best locking system for multiprocessors.
|
||||
It tries to solve the issue that if a cache line contains a lock, it is continuously invalidated and that dominates the intercom traffic.
|
||||
The solution to the problem is that a processor enqueues itself on a list of waiting processors and then spins on its \bi{own entry} in the list.
|
||||
\inputcodewithfilename{c}{}{code-examples/03_hw/02_mcs.c}
|
||||
|
||||
Binary file not shown.
Reference in New Issue
Block a user