[SPCA] Restructuring, finish memory management in C, start dynamic memory management section

This commit is contained in:
2026-01-07 11:54:37 +01:00
parent 2afa0ff161
commit 2c9921c6d1
23 changed files with 248 additions and 66 deletions

View File

@@ -0,0 +1,12 @@
\subsection{Basics}
\texttt{C} uses a very similar syntax as many other programming languages, like \texttt{Java}, \texttt{JavaScript} and many more\dots
to be precise, it is \textit{them} that use the \texttt{C} syntax, not the other way around. So:
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{00_intro.c}
In \texttt{C} we are referring to the implementation of a function as a \bi{(function) definition} (correspondingly, \textit{variable definition}, if the variable is initialized)
and to the definition of the function signature (or variables, without initializing them) as the \bi{(function) declaration} (or, correspondingly, \textit{variable declaration}).
\texttt{C} code is usuallt split into the source files, ending in \texttt{.c} (where the local functions and variables are declared, as well as all function definitions)
and the header files, ending in \texttt{.h}, usually sharing the filename of the source file, where the external declarations are defined.
By convention, no definition of functions are in the \texttt{.h} files, and neither variables, but there is nothing preventing you from putting them there.
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{01_func.h}

View File

@@ -0,0 +1,14 @@
\newpage
\subsubsection{Control Flow}
Many of the control-flow structures of \texttt{C} can be found in the below code snippet.
A note of caution when using goto: It is almost never a good idea (can lead to unexpected behaviour, is hard to maintain, etc).
Where it however is very handy is for error recovery (and cleanup functions) and early termination of multiple loops (jumping out of a loop).
So, for example, if you have to run multiple functions to set something up and one of them fails,
you can jump to a label and have all cleanup code execute that you have specified there.
And because the labels are (as in Assembly) simply skipped over during execution, you can make very nice cleanup code.
We can also use \texttt{continue} and \texttt{break} statements similarly to \texttt{Java}, they do not however accept labels.
(Reminder: \texttt{continue} skips the loop body and goes to the next iteration)
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{01_func.c}

View File

@@ -0,0 +1,86 @@
\newpage
\subsubsection{Declarations}
We have already seen a few examples for how \texttt{C} handles declarations.
In concept they are similar (and scoping works the same) to most other \texttt{C}-like programming languages, including \texttt{Java}.
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{02_declarations.c}
\newpage
A peculiarity of \texttt{C} is that the bit-count is not defined by the language, but rather the hardware it is compiled for.
\rmvspace
\begin{fullTable}{llll}{\texttt{C} data type & typical 32-bit & ia32 & x86-64}{Comparison of byte-sizes for each datatype on different architectures}
\texttt{char} & 1 & 1 & 1 \\
\texttt{short} & 2 & 2 & 2 \\
\texttt{int} & 4 & 4 & 4 \\
\texttt{long} & 4 & 4 & 8 \\
\texttt{long long} & 8 & 8 & 8 \\
\texttt{float} & 4 & 4 & 4 \\
\texttt{double} & 4 & 8 & 8 \\
\texttt{long double} & 8 & 10/12 & 16 \\
\end{fullTable}
\drmvspace
\warn{Type format} Be however aware that this table uses the \texttt{LP64} format for the x86-64 sizes
and this is the format all UNIX-Systems use (i.e. Linux, BSD, Darwin (the Mac Kernel)).
64 bit Windows however uses \texttt{LLP64}, i.e. \texttt{int} and \texttt{long} have the same size (32) and \texttt{long long} and pointers are 64 bit.
\content{Integers} By default, integers in \lC\ are \texttt{signed}, to declare an unsigned integer, use \texttt{unsigned int}.
Since it is hard and annoying to remember the number of bytes that are in each data type, \texttt{C99} has introduced the extended integer types,
which can be imported from \texttt{stdint.h} and are of form \texttt{int<bit count>\_t} and \texttt{uint<bit count>\_t},
where we substitute the \texttt{<bit count>} with the number of bits (have to correspond to a valid type of course).
\content{Booleans} Another notable difference of \texttt{C} compared to other languages is that \texttt{C} doesn't natively have a \texttt{boolean} type,
by convention a \texttt{short} is used to represent it, where any non-zero value means \texttt{true} and \texttt{0} means \texttt{false}.
Since boolean types are quite handy, the \texttt{!} syntax for negation turns any non-zero value of any integer type into zero and vice-versa.
\texttt{C99} has added support for a bool type via \texttt{stdbool.h}, which however is still an integer.
\content{Implicit casts} Notably, \texttt{C} doesn't have a very rigid type system and lower bit-count types are implicitly cast to higher bit-count data types, i.e.
if you add a \texttt{short} and an \texttt{int}, the \texttt{short} is cast to \texttt{short} (bits 16-31 are set to $0$) and the two are added.
Explicit casting between almost all types is also supported.
Some will force a change of bit representation, but most won't (notably, when casting to and from \texttt{float}-like types, minus to \texttt{void})
\content{Expressions} Every \lC\ statement is also an expression, see above code block for example.
\content{Void} The \texttt{void} type has \bi{no} value and is used for untyped pointers and declaring functions with no return value
\content{Structs} Are like classes in OOP, but they contain no logic.
We can assign copy a struct by assignment and they behave just like everything else in \texttt{C} when used as an argument for functions
in that they are passed by value and not by reference.
You can of course pass it also by reference (like any other data type) by setting the argument to type \texttt{struct mystruct * name} and then calling the function using
\texttt{func(\&test)} assuming \texttt{test} is the name of your struct
\content{Typedef} To define a custom type using \texttt{typedef <type it represents> <name of the new type>}.
You may also use \texttt{typedef} on structs using \texttt{typedef struct <struct tag> <name of the new alias>},
you can thus instead of e.g. \verb|struct list_el my_list;| write \verb|list my_list;|, if you have used \verb|typedef struct list_el list;| before.
It is even possible to do this:
\drmvspace
\begin{code}{c}
typedef struct list_el {
unsigned long val;
struct list_el *next;
} list_el;
struct list_el my_list;
list_el my_other_list;
\end{code}
\rmvspace
\content{Namespaces}
\lC\ has a few different namespaces, i.e. you can have the one of the same name in each namespace (i.e. you can have \texttt{struct a}, \texttt{int a}, etc).
The following namespaces were covered:
\rmvspace
\begin{itemize}[noitemsep]
\item Label names (used for \texttt{goto})
\item Tags (for \texttt{struct}, \texttt{union} and \texttt{enum})
\item Member names one namespace for each \texttt{struct}, \texttt{union} and \texttt{enum}
\item Everything else mostly (types, variable names, etc, including typedef)
\end{itemize}

View File

@@ -0,0 +1,46 @@
\newpage
\subsubsection{Operators}
The list of operators in \lC\ is similar to the one of \texttt{Java}, etc.
In Table \ref{tab:c-operators}, you can see an overview of the operators, sorted by precedence in descending order.
You may notice that the \verb|&| and \verb|*| operators appear twice. The higher precedence occurrence is the address operator and dereference, respectively,
and the lower precedence is \texttt{bitwise and} and \texttt{multiplication}, respectively.
Very low precedence belongs to boolean operators \verb|&&| and \texttt{||}, as well as the ternary operator and assignment operators
\begin{table}[h!]
\begin{tables}{ll}{Operator & Associativity}
\texttt{() [] -> .} & Left-to-right \\
\verb|! ~ ++ -- + - * & (type) sizeof| & Right-to-left \\
\verb|* / %| & Left-to-right \\
\verb|+ -| & Left-to-right \\
\verb|<< >>| & Left-to-right \\
\verb|< <= >= >| & Left-to-right \\
\verb|== !=| & Left-to-right \\
\verb|&| (logical and) & Left-to-right \\
\verb|^| (logical xor) & Left-to-right \\
\texttt{|} (logical or) & Left-to-right \\
\verb|&&| (boolean and) & Left-to-right \\
\texttt{||} (boolean or) & Left-to-right \\
\texttt{? :} (ternary) & Right-to-left \\
\verb|= += -= *= /= %= &= ^=||\verb|= <<= >>=| & Right-to-left \\
\verb|,| & Left-to-right \\
\end{tables}
\caption{\lC\ operators ordered in descending order by precedence}
\label{tab:c-operators}
\end{table}
\shade{blue}{Associativity}
\begin{itemize}
\item Left-to-right: $A + B + C \mapsto (A + B) + C$
\item Right-to-left: \texttt{A += B += C} $\mapsto$ \texttt{(A += B) += C}
\end{itemize}
As it should be, boolean and, as well as boolean or support early termination.
The ternary operator works as in other programming languages \verb|result = expr ? res_true : res_false;|
As previously touched on, every statement is also an expression, i.e. the following works
\mint{c}|printf("%s", x = foo(y)); // prints output of foo(y) and x has that value|
Pre-increment (\texttt{++i}, new value returned) and post-increment (\texttt{i++}, old value returned) are also supported by \lC.
\lC\ has an \texttt{assert} statement, but do not use it for error handling. The basic syntax is \texttt{assert( expr );}

View File

@@ -0,0 +1,8 @@
\newpage
\subsubsection{Arrays}
\label{sec:c-arrays}
\lC\ compiler does not do any array bound checks! Thus, always check array bounds.
Unlike some other programming languages, arrays are \bi{not} dynamic length.
The below snippet includes already some pointer arithmetic tricks. The variable \texttt{data} is a pointer to the first element of the array.
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{03_arrays.c}

View File

@@ -0,0 +1,6 @@
\subsubsection{Strings}
\lC\ doesn't have a \texttt{string} data type, but rather, strings are represented (when using \texttt{ASCII}) as \texttt{char} arrays,
with length of the array $n + 1$ (where $n$ is the number of characters of the string).
The extra element is the termination character, called the \texttt{null character}, denoted \verb|\0|.
To determine the actual length of the string (as it may be padded), we can use \verb|strnlen(str, maxlen)| from \texttt{string.h}
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{04_strings.c}

View File

@@ -0,0 +1,39 @@
\subsubsection{Integers in C}
As a reminder, integers are encoded as follows in big endian notation, with $x_i$ being the $i$-th bit and $w$ being the number of bits used to represent the number:
\begin{itemize}[noitemsep]
\item \bi{Unsigned}: $\displaystyle \sum_{i = 0}^{w - 1} x_i \cdot 2^i$
\item \bi{Signed}: $\displaystyle -x_{w - 1} \cdot 2^{w - 1} + \sum_{i = 0}^{w - 1} x_i \cdot 2^i$ (two's complement notation, with $x_{w - 1}$ being the sign-bit)
\end{itemize}
The minimum number representable is $0$ and $-2^{w - 1}$, respectively, whereas the maximum number representable is $2^w - 1$ and $2^{w - 1} - 1$.
\verb|limits.h| defines constants for the minimum and maximum values of different types, e.g. \verb|ULONG_MAX| or \verb|LONG_MAX| and \verb|LONG_MIN|
We can use the shift operators to multiply and divide by two. Shift operations are usually \textit{much} cheaper than multiplication and division.
Left shift (\texttt{u << k} in \lC) always fills with zeros and throws away the extra bits on the left (equivalent to multiplication by $2^k$),
whereas right shift (\texttt{u >> k} in \lC) is implementation-defined,
either arithmetic (fill with most significant bit, division by $2^k$. This however rounds incorrectly, see below)
or logical shift (fill with zeros, unsigned division by $2^k$).
Signed division using arithmetic right shifts has the issue of incorrect rounding when number is $< 0$.
Instead, we represent $s / 2^k = s + (2^k - 1) \texttt{ >> } k$ for $s < 0$ and $s / 2^k = s >> k$ for $s > 0$
\bi{In expressions, signed values are implicitly cast to unsigned}
This can lead to all sorts of nasty exploits (e.g. provide $-1$ as the argument to \texttt{memcpy} and watch it burn, this was an actual exploit in FreeBSD)
\fhlc{Cyan}{Addition \& Subtraction}
A nice property of the two's complement notation is that addition and subtraction works exactly the same as in normal notation, due to over- and underflow.
This also obviously means that it implements modular arithmetic, i.e.
\mrmvspace
\begin{align*}
\texttt{Add}_w (u, v) = u + v \text{ mod } 2^w \ \text{ and } \ \texttt{Sub}_w (u, v) = u - v \text{ mod } 2^w
\end{align*}
\mrmvspace
\fhlc{Cyan}{Multiplication \& Division}
Unsigned multiplication with addition forms a commutative ring.
Again, it is doing modular arithmetic and
\begin{align*}
\texttt{UMult}_w (u, v) = u \cdot v \text{ mod } 2^w
\end{align*}

View File

@@ -0,0 +1,53 @@
\newpage
\subsubsection{Pointers}
On loading of a program, the OS creates the virtual address space for the process, inspects the executable and loads the data to the right places in the address space,
before other preparations like final linking and relocation are done.
Stack-based languages (supporting recursion) allocate stack in frames that contain local variables, return information and temporary space.
When a procedure is entered, a stack frame is allocated and executes any necessary setup code (like moving the stack pointer, see later). % TODO: Link to correct section
When a procedure returns, the stack frame is deallocated and any necessary cleanup code is executed, before execution of the previous frame continues.
\bi{In \lC\ a pointer is a variable whose value is the memory address of another variable}
Of note is that if you simply declare a pointer using \texttt{type * p;} you will get different memory addresses every time.
The (Linux)-Kernel randomizes the address space to prevent some common exploits.
\inputcodewithfilename{c}{code-examples/00_c/00_basics/}{05_pointers.c}
\newpage
\begin{scriptsize}
Some pointer arithmetic has already appeared in section \ref{sec:c-arrays}, but same kind of content with better explanation can be found here
\end{scriptsize}
\content{Pointer Arithmetic} Note that when doing pointer arithmetic, adding $1$ will move the pointer by \texttt{sizeof(type)} bits.
You may use pointer arithmetic on whatever pointer you'd like (as long as it's not a null pointer).
This means, you \textit{can} make an array wherever in memory you'd like.
The issue is just that you are likely to overwrite something, and that something might be something critical (like a stack pointer),
thus you will get \bi{undefined} behaviour! (This is by the way a common concept in \lC, if something isn't easy to make more flexible
(example for \texttt{malloc}, if you pass a pointer to memory that is not the start of the \texttt{malloc}'d section, you get undefined behaviour),
in the docs mention that one gets undefined behaviour if you do not do as it says so\dots RTFM!)
As already seen in the section arrays (section \ref{sec:c-arrays}), we can use pointer arithmetic for accessing array elements.
The array name is treated as a pointer to the first element of the array, except when:
\begin{itemize}[noitemsep]
\item it is operand of \texttt{sizeof} (return value is $n \cdot \texttt{sizeof(type)}$ with $n$ the number of elements)
\item its address is taken (then \texttt{\&a == a})
\item it is a string literal initializer. If we modify a pointer \texttt{char *b = "String";} to string literal in code,
the \texttt{"String"} is stored in the code segment and if we modify the pointer, we get undefined behaviour
\end{itemize}
\shade{purple}{Fun fact}: \texttt{A[i]} is always rewritten \texttt{*(A + i)} by compiler.
\content{Function arguments} Another important aspect is passing by value or by reference.
You can pass every data type by reference, you can not however pass an array by value (as an array is treated as a pointer, see above).
\content{Body-less loops}
\rmvspace
\begin{code}{c}
int x = 0;
while ( x++ < 10 ); // This is (of course) not a useful snippet, but shows the concept
\end{code}
\content{Function pointers}
A function can be passed as an argument to another function using the typical address syntax with the \verb|&| symbol is annotated as argument using
\verb|type (* name)(type arg1, ...)|
and is called using \verb|(*func)(arg1, ...)|.