\newsection \section{Probability} \subsection{Basics} \begin{definition}[]{Discrete Sample Space} A sample space $S$ consists of a set $\Omega$ consisting of \textit{elementary events} $\omega_i$. Each of these elementary events has a probability assigned to it, such that $0 \leq \Pr[\omega_i] \leq 1$ and \[ \sum_{\omega \in \Omega} \Pr[\omega] = 1 \] We call $E \subseteq \Omega$ an \textit{event}. The probability $\Pr[E]$ of said event is given by \[ \Pr[E] := \sum_{\omega \in E} \Pr[\omega] \] If $E$ is an event, we call $\overline{E} := \Omega \backslash E$ the \textit{complementary event} \end{definition} \begin{lemma}[]{Events} For two events $A, B$, we have: \begin{multicols}{2} \begin{enumerate} \item $\Pr[\emptyset] = 0, \Pr[\Omega] = 1$ \item $0 \leq \Pr[A] \leq 1$ \item $\Pr[\overline{A}] = 1 - \Pr[A]$ \item If $A \subseteq B$, we have $\Pr[A] \leq \Pr[B]$ \end{enumerate} \end{multicols} \end{lemma} \begin{theorem}[]{Addition law} If events $A_1, \ldots, A_n$ are relatively disjoint (i.e. $\forall (i \neq j) : A_i \cap A_j = \emptyset$), we have (for infinite sets, $n = \infty$) \[ \Pr\left[ \bigcup_{i = 1}^{n} A_i \right] = \sum_{i = 1}^{n} \Pr[A_i] \] \end{theorem} \newpage \label{sec:prob-basics} \setcounter{all}{5} The below theorem is known as the Inclusion-Exclusion-Principle, or in German the ``Siebformel'' and is the general case of the addition law, where the events don't have to be disjoint. \begin{theorem}[]{Inclusion/Exclusion} Let $A_1, \ldots, A_n$ be events, for $n \geq 2$. Then we we have \begin{align*} \Pr\left[ \bigcup_{i = 1}^{n} A_i \right] & = \sum_{l = 1}^{n} (-1)^{l + 1} \sum_{1 \leq i_1 < \ldots < i_l \leq n} \Pr[A_{i_1} \cap \ldots \cap A_{i_l}] \\ & = \sum_{i = 1}^{n} \Pr[A_i] - \sum_{1 \leq i_1 < i_2 \leq n} \Pr[A_{i_1} \cap A_{i_2}] + \sum_{1\leq i_1 < i_2 < i_3 \leq n} \Pr[A_{i_1} \cap A_{i_2} \cap A_{i_3}] -\ldots \\ & + (-1)^{n + 1} \cdot \Pr[A_1 \cap \ldots \cap A_n] \end{align*} \end{theorem} What is going on here? We add all intersections where an even number of $\cap$-symbols are used and subtract all those who have and odd number of intersections. \fhlc{Cyan}{Use:} This is useful for all kinds of counting operations where some elements occur repeatedly, like counting the number of integers divisible by a list of integers (see Code-Expert Task 04) Of note here is that we sum up with e.g. $\displaystyle\sum_{1 \leq i_1 < j_1 \leq n} \Pr[A_{i_1} \cap A_{i_2}]$ is all subsets of the whole set $\Omega$, where two events are intersected / added. If $\Omega = A_1 \cup \ldots \cup A_n$ and $\Pr[\omega] = \frac{1}{|\Omega|}$, we get \[ \left|\bigcup_{i = 1}^{n}A_i\right| = \sum_{l = 1}^{n} (-1)^{l + 1} \sum_{1 \leq i_1 < \ldots < i_l \leq n} |A_{i_1} \cap \ldots \cap A_{i_l} \] Since for $n \geq 4$ the Inclusion-Exclusion-Principle formulas become increasingly long and complex, we can use a simple approximation, called the \textbf{Union Bound}, also known as the \textit{Boolean inequality} \begin{corollary}[]{Union Bound} For events $A_1, \ldots, A_n$ we have (for infinite sequences of events, $n = \infty$) \[ \Pr\left[ \bigcup_{i = 1}^{n} A_i \right] \leq \sum_{i = 1}^{n} \Pr[A_i] \] \end{corollary} \vspace{1cm} \begin{center} \fbox{\textbf{Laplace principle}: We can assume that all outcomes are equally likely if nothing speaks against it} \end{center} \vspace{1cm} Therefore, we have $\Pr[\omega] = \displaystyle \frac{1}{|\Omega|}$ and for any event $E$, we get $\displaystyle \Pr[E] = \frac{|E|}{|\Omega|}$ \newpage \subsection{Conditional Probability} \setcounter{all}{8} \begin{definition}[]{Conditional Probability} Let $A, B$ be events, with $\Pr[B] > 0$. The \textit{conditional probability} $\Pr[A|B]$ of $A$ given $B$ is defined as \[ \Pr[A|B] := \frac{\Pr[A \cap B]}{\Pr[B]} \] We may also rewrite the above as \[ \Pr[A \cap B] = \Pr[B|A] \cdot \Pr[A] = \Pr[A|B] \cdot \Pr[B] \] \end{definition} \setcounter{all}{10} \begin{theorem}[]{Multiplication law} Let $A_1, \ldots, A_n$ be events. If $\Pr[A_1 \cap \ldots \cap A_n] > 0$, we have \[ \Pr[A_1 \cap \ldots \cap A_n] = \Pr[A_1] \cdot \Pr[A_2|A_1] \cdot \Pr[A_3|A_1 \cap A_2] \cdot \ldots \cdot \Pr[A_n|A_1 \cap \ldots \cap A_n] \] \end{theorem} The proof of the above theorem is based on the definition of conditional probability. If we rewrite $\Pr[A_1] = \frac{\Pr[A_1]}{1}$, apply the definition of $\Pr[A_2 | A_1]$, and do the same to all subsequent terms, the equation simplifies to $\Pr[A_1 \cap \ldots \cap A_n]$ \fhlc{Cyan}{Use:} The law of total probability is used, as the name implies, to calculate the total probability of all possible ways in which an even $B$ can occur. \setcounter{all}{13} \begin{theorem}[]{Law of total probability} Let $A_1, \ldots, A_n$ be relatively disjoint events and let $B \subseteq A_1 \cup \ldots \cup A_n$. We then have \[ \Pr[B] = \sum_{i = 1}^{n} \Pr[B|A_i] \cdot \Pr[A_i] \] The same applies for $n = \infty$. Then, $B = \bigcup_{i = 1}^{\infty} A_i$ \end{theorem} Using the previously defined theorem, we get Bayes' Theorem \setcounter{all}{15} \begin{theorem}[]{Bayes' Theorem} Let $A_1, \ldots, A_n$ be relatively disjoint events and let $B \subseteq A_1 \cup \ldots \cup A_n$ be an event with $\Pr[B] > 0$. Then for each $i = 1, \ldots, n$, we have \[ \Pr[A_i|B] = \frac{\Pr[A_i \cap B]}{\Pr[B]} = \frac{\Pr[B|A_i] \cdot \Pr[A_i]}{\sum_{j = 1}^{n} \Pr[B|A_j] \cdot \Pr[A_j]} \] The same applies for $n = \infty$. Then $B = \bigcup_{i = 1}^{\infty} A_i$ \end{theorem} \fhlc{Cyan}{Use:} Bayes' Theorem is commonly used to calculate probabilities on different branches or in other words, to rearrange conditional probabilities. The sum in the denominator represents all posible paths to the event summed up \inlineex \hspace{0mm} Assume we want to find the probability that event $X$ happened given that event $Y$ happened. \textbf{Important:} Event $X$ happened \textit{before} event $Y$ happened and we do \textit{not} know the probability of $X$. Therefore we have $\Pr[X|Y]$ as the probability. But we don't actually know that probability, so we can use Bayes' Theorem to restate the problem in probabilities we can (more) easily determine. \newpage \subsection{Independence} \setcounter{all}{18} \fancydef{Independence of two events} Two events $A$ and $B$ are called \textbf{independent} if \[ \Pr[A \cap B] = \Pr[A] \cdot \Pr[B] \] \setcounter{all}{22} \begin{definition}[]{Independence} Events $A_1, \ldots, A_n$ are called \textit{independent}, if for all subsets $I \subseteq \{1, \ldots, n\}$ with $I = \{i_1, \ldots, i_k\}$ and $|I| = k$, we have that \[ \Pr[A_{i_1} \cap \ldots \cap A_{i_k}] = \Pr[A_{i_1}] \cdot \ldots \cdot \Pr[A_{i_k}] \] \end{definition} The same in simpler terms: If all events $A_1, \ldots, A_n$ are relatively disjoint, they are independent. We can determine if they are, if the probability of the intersection of all events is simply their individual probabilities multiplied with each other. \begin{lemma}[]{Independence} Events $A_1, \ldots, A_n$ are independent if and only if for all $(s_1, \ldots, s_n) \in \{0, 1\}^n$ we have \[ \Pr[A_1^{s_1} \cap \ldots \cap A_n^{s_n}] = \Pr[A_1^{s_1}] \cdot \ldots \cdot \Pr[A_n^{s_n}] \] where $A_i^{0} = \overline{A_i}$ (i.e. $s_i = 0$) and $A_i^{1} = A_i$ (i.e. $s_i = 1$) \end{lemma} $\{0, 1\}^n$ is the space of $n$-bit binary numbers, representing subsets of the sample space, each of them being any of the subsets intersected with up to $n$ other subsets The $s_i$ in this expression are very straight forward to understand as simply indicating if we consider the event or its complement. \fancylemma{Let $A$, $B$ and $C$ be independent events. Then, $A\cap B$ and $C$ as well $A \cup B$ and $C$ are independent} In this lecture, we are always going to assume that we can use actual random numbers, not just pseudo random numbers that are generated by PRNGs (Pseudo Random Number Generators). \newsection \subsection{Random Variables} \setcounter{all}{25} \begin{definition}[]{Random Variable} A \textit{random variable} is an image $\mathcal{X}: \Omega \rightarrow \R$ that maps the sample space to a real number. The range $W_{\mathcal{X}} := \mathcal{X}(\Omega) = \{ x \in \R : \forall \omega \in \Omega \text{ with } \mathcal{X}(\omega) = x \}$'s countability depends on the countability of $\Omega$, and is either \textit{countable} or \textit{countably infinite} \end{definition} \begin{scriptsize} \textit{For those who don't have an intuition for what a random variable actually is: See Section \ref{sec:random-var-details}.} \end{scriptsize} Often times when looking at random variables, we are interested in the probabilities at which $\mathcal{X}$ takes certain values. We write either $\mathcal{X}^{-1}(x_i)$ or more intuitively $\mathcal{X} = x_i$. Analogously, we have (short: $\Pr[``\mathcal{X} \leq x_i'']$ as $\Pr[\mathcal{X} \leq x_i]$) \[ \Pr[``\mathcal{X} \leq x_i''] = \sum_{x \in W_{\mathcal{X}} : x \leq x_i} \Pr[``\mathcal{X} = x''] = \Pr[\{ \omega \in \Omega : \mathcal{X}(\omega) \leq x_i \}] \] From this notation, we easily get two real functions. We call $f_{\mathcal{X}}: \R \rightarrow [0, 1]$ for which $x \mapsto \Pr[\mathcal{X} = x]$ the \textbf{\textit{probability mass function}} (PMF, Dichtefunktion) of $\mathcal{X}$, which maps a real number to the probability that the random variable takes this value. The \textbf{\textit{cumulative distribution function}} (CDF, Verteilungsfunktion) of $\mathcal{X}$ is a function, which maps a real number to the probability that the value taken by the random variable is lower than, or equal to, the real number. Often times it suffices to state the PMF of the random variable (since we can easily derive the CDF from it) \[ F_{\mathcal{X}} : \R \rightarrow [0, 1], \mediumhspace x\rightarrow \Pr[\mathcal{X} \leq x] = \sum_{x' \in W_{\mathcal{X}} : x' \leq x} \Pr[\mathcal{X} = x'] = \sum_{x' \in W_{\mathcal{X}} : x' \leq x} f_{\mathcal{X}}(x') \] \subsubsection{Expected value} \setcounter{all}{27} \begin{definition}[]{Expected Value} The \textit{expected value} $\E[\mathcal{X}]$ describes the average value the random variable $\mathcal{X}$ takes. We define the \textit{expected value} $\E[\mathcal{X}]$ as \[ \E[\mathcal{X}] := \sum_{x \in W_{\mathcal{X}}} x \cdot \Pr[\mathcal{X} = x] \] only if the sum converges absolutely. Otherwise, the \textit{expected value} is undefined. This is trivially true for finite sample spaces. \end{definition} \begin{scriptsize} In this lecture, only random variables with an expected value are covered, so that condition does not need to be checked here \end{scriptsize} \setcounter{all}{29} Alternative to the above definition over the elements of the range of the random variable, we can also define it as \begin{lemma}[]{Expected Value} If $\mathcal{X}$ is a random variable, we have \[ \E[\mathcal{X}] = \sum_{\omega \in \Omega} \mathcal{X}(\omega) \cdot \Pr[\omega] \] \end{lemma} If the range of the random variable consists only of non-zero integers, we can calculate the expected value with the following formula \begin{theorem}[]{Expected Value} Let $\mathcal{X}$ be a random variable with $W_{\mathcal{X}} \subseteq \N_0$. We then have \begin{align*} \E[\mathcal{X}] = \sum_{i = 1}^{\infty} \Pr[\mathcal{X} \geq i] \end{align*} \end{theorem} \newpage \fhlc{Cyan}{Conditional Random Variables} \begin{definition}[]{Conditional Random Variable} Let $\mathcal{X}$ be a random variable and let $A$ be an event with $\Pr[A] > 0$ \begin{align*} \Pr[(\mathcal{X} | A) \leq x] = \Pr[X \leq x | A] = \frac{\Pr[\{\omega \in A : \mathcal{X}(\omega) \leq x\}]}{\Pr[A]} \end{align*} \end{definition} \begin{theorem}[]{Expected Value (Conditional)} Let $\mathcal{X}$ be a random variable. For relatively disjoint events $A_1, \ldots, A_n$ with $A_1 \cup \ldots \cup A_n = \Omega$ and $\Pr[A_1], \ldots, \Pr[A_n] > 0$, we have (analogously for $n = \infty$) \begin{align*} \E[\mathcal{X}] = \sum_{i = 1}^{n} \E[\mathcal{X} | A_i] \cdot \Pr[A_i] \end{align*} \end{theorem} \fhlc{Cyan}{Linearity of the expected value} We can calculate the expected value of a sum of any number of random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n : \Omega \rightarrow \R$ simply by summing the expected values of each of the random variables $\mathcal{X}_i$ \begin{theorem}[]{Linearity of expected value} Given random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n$ and let $\mathcal{X} := a_1 \mathcal{X}_1 + \ldots + a_n \mathcal{X}_n + b$ for any $a_1, \ldots, a_n, b \in \R$, we have \begin{align*} \E[\mathcal{X}] = a_1 \cdot \E[\mathcal{X}_1] + \ldots + a_n \cdot \E[\mathcal{X}_n] + b \end{align*} \end{theorem} Very simply with two random variables $X$ and $Y$, we have $\E[X + Y] = \E[X] + \E[Y]$ \setcounter{all}{35} \begin{definition}[]{Indicator Variable} We use \textit{indicator variables} to formalize the probability that an event $A$ occurs using the expected value For an event $A \subseteq \Omega$ the accompanying indicator variable $\mathcal{X}_A$ is given by \begin{align*} \mathcal{X}_A(\omega) := \begin{cases} 1 & \text{if } \omega \in A \\ 0 & \text{else } \end{cases} \end{align*} For the expected value of $\mathcal{X}_A$ we have: $\E[\mathcal{X}_A] = \Pr[A]$ \end{definition} We can now prove the Inclusion-Exclusion-Principle using a fairly simple proof. See Example 2.36 in the script for it. \fhlc{Cyan}{Use:} We use the indicator variable for experiments where we perform a certain action numerous times where each iteration does not (or does for that matter) depend on the previous outcome. \newpage \subsubsection{Variance} Even though two random variables may have the same expected value, they can still be significantly different. The Variance describes the dispersion of the results, or how far off the expected value the different values are maximally (up to a certain limit, that is) \setcounter{all}{39} \begin{definition}[]{Variance} For a random variable $\mathcal{X}$ with $\mu = \E[\mathcal{X}]$, the \textit{variance} $\text{Var}[\mathcal{X}]$ is given by \begin{align*} \text{Var}[\mathcal{X}] := \E[(\mathcal{X} - \mu)^2] = \sum_{x \in W_{\mathcal{X}}} (x - \mu)^2 \cdot \Pr[\mathcal{X} = x] \end{align*} $\sigma := \sqrt{\text{Var}[\mathcal{X}]}$ is called the \textit{standard deviation} of $\mathcal{X}$ \end{definition} \begin{theorem}[]{Variance (easier)} For any random variable $\mathcal{X}$ we have \[ \text{Var}[\mathcal{X}] = \E[\mathcal{X}^2] - \E[\mathcal{X}]^2 \] \end{theorem} We also have \begin{theorem}[]{Variance} For any random variable $\mathcal{X}$ and $a, b \in \R$ we have \[ \text{Var}[a \cdot \mathcal{X} + b] = a^2 \cdot \text{Var}[\mathcal{X}] \] \end{theorem} The moments of a random variable are given by the expected value and the variance. \begin{definition}[]{Moment} The \textbf{\textit{$k$th moment}} of a random variable $\mathcal{X}$ is $\E[\mathcal{X}^k]$ whereas $\E[(\mathcal{X} - \E[\mathcal{X}])^k]$ is called the \textbf{\textit{$k$th central moment}}. \end{definition} \shade{gray}{Note} The expected value is thus the first moment and the variance the second central moment. \subsubsection{Intuition} \label{sec:random-var-details} If you struggle to imagine what a random variable $\mathcal{X}$ is, or what for example $\mathcal{X}^2$ is, read on. As definition 3.25 states, a random variable is a function, which is why people tend to get confused. It is not a variable in the normal way of understanding. With that in mind, things like $\mathcal{X}^2$ makes much more sense, as it's simply the result of the function squared, which then makes theorem 3.40 make much more sense, given the definition of the expected value. Of note is that remembering the summation formulas for the variance (or knowing how to get to it) is handy for the exam, as that formula is not listed on the cheat-sheet provided by the teaching team as of FS25. Deriving it is very easy though, as it's simply applying the expected value definition to the initial definition, which is listed on the cheat-sheet. % Page 126 (actual) \newpage \subsection{Discrete distribution} \subsubsection{Bernoulli-Distribution} A random variable $\mathcal{X}$ with $W_{\mathcal{X}}$ is called \textit{\textbf{Bernoulli distributed}} if and only if its probability mass function is of form \[ f_{\mathcal{X}}(x) = \begin{cases} p & x = 1 \\ 1 - p & x = 0 \\ 0 & \text{else} \end{cases} \] The parameter $p$ is called the probability of success (Erfolgswahrscheinlichkeit). Bernoulli distribution is used to describe boolean events (that can either occur or not). It is the trivial case of binomial distribution with $n = 1$. If a random variable $\mathcal{X}$ is Bernoulli distributed, we write \[ \mathcal{X} \sim \text{Bernoulli}(p) \] and we have \[ \E[\mathcal{X}] = p \hspace{1cm} \text{and} \hspace{1cm} \text{Var}[\mathcal{X}] = p(1 - p) \] \subsubsection{Binomial Distribution} If we perform a Bernoulli trial repeatedly (e.g. we flip a coin $n$ times), the number of times we get one of the outcomes is our random variable $\mathcal{X}$ and it is called \textbf{\textit{binomially distributed}} and we write \[ \mathcal{X} \sim \text{Bin}(n, p) \] and we have \[ \E[\mathcal{X}] = np \hspace{1cm} \text{and} \hspace{1cm} \text{Var}[\mathcal{X}] = np(1 - p) \] \subsubsection{Geometric Distribution} If we have an experiment that is repeated until we have achieved success, where the probability of success is $p$, the number of trials (which is described by the random variable $\mathcal{X}$) is \textbf{\textit{geometrically distributed}}. We write \[ \mathcal{X} \sim \text{Geo}(p) \] The density function is given by \[ f_{\mathcal{X}}(i) = \begin{cases} p(1 - p)^{i - 1} & \text{for } i \in \N \\ 0 & \text{else} \end{cases} \] whilst the expected value and variance are defined as \[ \E[\mathcal{X}] = \frac{1}{p} \hspace{1cm} \text{and} \hspace{1cm} \text{Var}[\mathcal{X}] = \frac{1 - p}{p^2} \] The cumulative distribution function is given by \[ F_{\mathcal{X}}(n) = \Pr[\mathcal{X} \leq n] = \sum_{i = 1}^{n} \Pr[\mathcal{X} = i] = \sum_{i = 1}^{n} p(1 - p)^{i - 1} = 1 - (1 - p)^n \] \shade{gray}{Note} Every trial in the geometric distribution is unaffected by the previous trials \setcounter{all}{45} \begin{theorem}[]{Geometric Distribution} If $\mathcal{X} \sim \text{Geo}(p)$, for all $s, t \in \N$ we have \[ \Pr[\mathcal{X} \geq s + t | X > s] = \Pr[X \geq t] \] \end{theorem} \newpage \fhlc{cyan}{Coupon Collector problem} First some theory regarding waiting for the $n$th success. The probability mass function is given by $f_{\mathcal{X}}(x) = \begin{pmatrix}z - 1\\ n - 1\end{pmatrix} \cdot p^n \cdot (1- p)^{z - n}$ whereas the expected value is given by $\displaystyle\E[\mathcal{X}] = \sum_{i = 1}^{n} \E[\mathcal{X}_i] = \frac{n}{p}$ The coupon collector problem is a well known problem where we want to collect all coupons on offer. How many coupons do we need to obtain on average to get one of each? We will assume that the probability of getting coupon $i$ is equal to all other coupons and getting a coupon doesn't depend on what coupons we already have (independence) Let $\mathcal{X}$ be a random variable representing the number of purchases to the completion of the collection. We split up the time into separate phases, where $\mathcal{X}$ is the number of coupons needed to end phase $i$, which ends when we have found one of the $n - i + 1$ coupons not previously collected (i.e. we got a coupon we haven't gotten yet) Logically, $\mathcal{X} = \sum_{i = 1}^{n} \mathcal{X}_i$. We can already tell from the experiment we are conducting that it is going to be geometrically distributed and thus the probability of success is going to be $p = \frac{n - i + 1}{n}$ and we have $\E[\mathcal{X}_i] = \frac{n}{n - i + 1}$ With that, let's determine \[ \E[\mathcal{X}] = \sum_{i = 1}^{n} \E[\mathcal{X}_i] = \sum_{i = 1}^{n} \frac{n}{n - i + 1} = n \cdot \sum_{i = 1}^{n} \frac{1}{i} = n \cdot H_n \] where $H_n := \sum_{i = 1}^{n} \frac{1}{i}$ is the $n$th harmonic number, which we know (from Analysis) is $H_n = \ln(n) +$\tco{1}, thus we have $\E[\mathcal{X}] = n \cdot \ln(n) +$\tco{n}. The idea of the transformation is to reverse the $(n - i + 1)$, so counting up instead of down, massively simplifying the sum and then extracting the $n$ and using the result of $H_n$ to fully simplify \subsubsection{Poisson distribution} The \textbf{\textit{Poisson distribution}} is applied when there is only a small likelihood that an event occurs, but since the cardinality of the sample space in question is large, we can expect at least a few events to occur. We write \[ \mathcal{X} \sim \text{Po}(\lambda) \] An example for this would be for a person to be involved in an accident over the next hour. The probability mass function is given by \[ f_{\mathcal{X}}(i) = \begin{cases} \frac{e^{-\lambda}\lambda^i}{i!} & \text{for } i \in \N_o \\ 0 & \text{else} \end{cases} \hspace{1cm} \text{and} \hspace{1cm} \E[\mathcal{X}] = \text{Var}[\mathcal{X}] = \lambda \] \shade{cyan}{Using the Poisson distribution as limit for the binomial distribution} We can approximate the binomial distribution using the Poisson distribution if we have large $n$ and small constant $np$. $\lambda = \E[\mathcal{X}] = np$ in that case. \newpage \subsection{Multiple random variables} There are times when we are interested in the outcomes of multiple random variables simultaneously. For two random variables $\mathcal{X}$ and $\mathcal{Y}$, we evaluate probabilities of type \[ \Pr[\mathcal{X} = x, \mathcal{Y} = y] = \Pr[\{ \omega \in \Omega : \mathcal{X}(\omega) = x, \mathcal{Y}(\omega) = y \}] \] Here $\Pr[\mathcal{X} = x, \mathcal{Y} = y]$ is a shorthand notation for $\Pr[``\mathcal{X} = x'' \cap ``\mathcal{Y} = y'']$ We define the \textit{common probability mass function} $f_{\mathcal{X}, \mathcal{Y}}$ by \[ f_{\mathcal{X}, \mathcal{Y}}(x, y) := \Pr[\mathcal{X} = x, \mathcal{Y} = y] \] We can also get back to the individual probability mass of each random variable \begin{align*} f_{\mathcal{X}} = \sum_{y \in W_{\mathcal{Y}}} f_{\mathcal{X}, \mathcal{Y}}(x, y) \hspace{1cm} \text{or} \hspace{1cm} f_{\mathcal{Y}}(y) = \sum_{x \in W_{\mathcal{X}}} f_{\mathcal{X}, \mathcal{Y}}(x, y) \end{align*} We hereby call $f_{\mathcal{X}}$ and $f_{\mathcal{Y}}$ \textit{marginal density} (Randdichte) We define the \textbf{\textit{common cumulative distribution function}} by \begin{align*} F_{\mathcal{X}, \mathcal{Y}}(x, y) := \Pr[\mathcal{X} \leq x, \mathcal{Y} \leq y] = \Pr[\{ \omega \in \Omega : \mathcal{X}(\omega) \leq x, \mathcal{Y} \leq y \}] = \sum_{x' \leq x} \sum_{y' \leq y} f_{\mathcal{X}, \mathcal{Y}}(x', y') \end{align*} Again, we can use marginal density \begin{align*} F_{\mathcal{X}}(x) = \sum_{x'\leq x} f_{\mathcal{X}}(x') = \sum_{x' \leq x} \sum_{y \in W_{\mathcal{Y}}} f_{\mathcal{X}, \mathcal{Y}}(x', y) \hspace{5mm} \text{and} \hspace{5mm} F_{\mathcal{Y}}(y) = \sum_{y'\leq y} f_{\mathcal{Y}}(y') = \sum_{y' \leq y} \sum_{x \in W_{\mathcal{X}}} f_{\mathcal{X}, \mathcal{Y}}(x, y') \end{align*} \subsubsection{Independence of random variables} \setcounter{all}{52} \begin{definition}[]{Independence} Random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n$ are called \textbf{\textit{independent}} if and only if for all $(x_1, \ldots, x_n) \in W_{\mathcal{X}_1} \times \ldots \times W_{\mathcal{X}_n}$ we have \begin{align*} \Pr[\mathcal{X}_1 = x_1, \ldots, \mathcal{X}_n = x_n] = \Pr[\mathcal{X}_1 = x_1] \cdot \ldots \cdot \Pr[\mathcal{X}_n = x_n] \end{align*} Or alternatively, using probability mass functions \begin{align*} f_{\mathcal{X}_1, \ldots \mathcal{X}_n}(x_1, \ldots, x_n) = f_{\mathcal{X}_1}(x_1) \cdot \ldots \cdot f_{\mathcal{X}_n}(x_n) \end{align*} In words, this means that for independent random variables, their common density is equal to the product of the individual marginal densities \end{definition} The following lemma shows that the above doesn't only hold for specific values, but also for sets \begin{lemma}[]{Independence} Let $\mathcal{X}_1, \ldots, \mathcal{X}_n$ be independent random variables and let $S_1, \ldots, S_n \subseteq \R$ be any set, then we have \begin{align*} \Pr[\mathcal{X}_1 \in S_1, \ldots, \mathcal{X}_n \in S_n] = \Pr[\mathcal{X}_1 \in S_1] \cdot \ldots \cdot \Pr[\mathcal{X}_n \in S_n] \end{align*} \end{lemma} \begin{corollary}[]{Independence} Let $\mathcal{X}_1, \ldots, \mathcal{X}_n$ be independent random variables and let $I = \{i_1, \ldots, i_k\} \subseteq [n]$, then $\mathcal{X}_{i_1}, \ldots, \mathcal{X}_{i_k}$ are also independent \end{corollary} \begin{theorem}[]{Independence} Let $f_1, \ldots, f_n$ be real-valued functions ($f_i : \R \rightarrow \R$ for $i = 1, \ldots, n$). If the random variables $X_1, \ldots, X_n$ are independent, then this also applies to $f_1(\mathcal{X}_1), \ldots, f_n(\mathcal{X}_n)$ \end{theorem} \subsubsection{Composite random variables} Using functions we can combine multiple random variables in a sample space. \setcounter{all}{58} \begin{theorem}[]{Two random variables} For two independent random variables $\mathcal{X}$ and $\mathcal{Y}$, let $\mathcal{Z} := \mathcal{X} + \mathcal{Y}$. Then we have \begin{align*} f_{\mathcal{Z}}(z) = \sum_{x \in W_{\mathcal{X}}} f_{\mathcal{X}} \cdot f_{\mathcal{Y}}(z - x) \end{align*} \end{theorem} We call, analogously to the terms used for power series, $f_{\mathcal{Z}}(z)$ ``convolution'' \subsubsection{Moments of composite random variables} \setcounter{all}{60} \begin{theorem}[]{Linearity of the expected value} For random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n$ and $\mathcal{X} := a_1 \mathcal{X}_1 + \ldots + a_n \mathcal{X}_n$ with $a_1, \ldots, a_n \in \R$ we have \begin{align*} \E[\mathcal{X}] = a_1 \E[\mathcal{X}_1] + \ldots + a_n \E[\mathcal{X}_n] \end{align*} \end{theorem} There are no requirements in terms of independence of the random variables, unlike for the multiplicativity \begin{theorem}[]{Multiplicativity of the expected value} For independent random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n$ we have \begin{align*} \E[\mathcal{X}_1 \cdot \ldots \cdot \mathcal{X}_n] = \E[\mathcal{X}_1] \cdot \ldots \cdot \E[\mathcal{X}_n] \end{align*} \end{theorem} \begin{theorem}[]{Variance of multiple random variables} For independent random variables $\mathcal{X}_1, \ldots, \mathcal{X}_n$ and $\mathcal{X} = \mathcal{X}_1 + \ldots + \mathcal{X}_n$ we have \begin{align*} \text{Var}[\mathcal{X}] = \text{Var}[\mathcal{X}_1] + \ldots + \text{Var}[\mathcal{X}_n] \end{align*} \end{theorem} \subsubsection{Wald's Identity} Wald's identity is used for cases where the number of summands is not a constant, commonly for algorithms that repeatedly call subroutines until a certain result is attained. The time complexity of such an algorithm can be approximated by splitting up the algorithm into phases, where each phase is a call of the subroutine. The number of calls to the subroutine, thus the number of phases, is usually not deterministic in that case but rather bound to a random variable. \setcounter{all}{65} \begin{theorem}[]{Wald's Identity} Let $\mathcal{N}$ and $\mathcal{X}$ be two independent random variables with $W_{\mathcal{N}} \subseteq \N$. Let \begin{align*} \mathcal{Z} := \sum_{i = 1}^{\mathcal{N}}\mathcal{X}_i \end{align*} where $\mathcal{X}_1, \mathcal{X}_2, \ldots$ are independent copies of $\mathcal{X}$. Then we have \begin{align*} \E[\mathcal{Z}] = \E[\mathcal{N}] \cdot \E[\mathcal{X}] \end{align*} \end{theorem} \newpage \subsection{Approximating probabilities} Since it can be very expensive to calculate the true probabilities in some cases, we will now cover some tools that allow us to approximate the probabilities using upper or lower bounds. \subsubsection{Markov's \& Chebyshev's inequalities} \setcounter{all}{67} \begin{theorem}[]{Markov's inequality} Let $\mathcal{X}$ be a random variable that may only take non-negative values. Then for all $t > 0 \in \R$, we have \begin{align*} \Pr[\mathcal{X} \geq t] \leq \frac{\E[\mathcal{X}]}{t} \Longleftrightarrow \Pr[\mathcal{X} \geq t \cdot \E[\mathcal{X}]] \leq \frac{1}{t} \end{align*} \end{theorem} Markov's inequality is fairly straight forward to prove, and it already allows us to make some useful statements, like that for the coupon collector problem, we only need to make more than $100 n \log(n)$ purchases with probability $\frac{1}{100}$. The following inequality usually gives a much more precise bound than Markov's inequality \begin{theorem}[]{Chebyshev's inequality} Let $\mathcal{X}$ be a random variable and $t > 0 \in\R$. Then we have \begin{align*} \Pr[|\mathcal{X} - \E[\mathcal{X}| \geq t]] \leq \frac{\text{Var}[\mathcal{X}]}{t^2} \Longleftrightarrow \Pr[|\mathcal{X} - \E[\mathcal{X}]| \geq t \cdot \sqrt{\text{Var}[\mathcal{X}]}] \leq \frac{1}{t^2} \end{align*} \end{theorem} A common tactic when using these is to restate the original probability $\Pr[X \geq t]$ as $\Pr[|X - \E[X]| \geq t - \E[X]]$ and then set $t = t'$ for $t' = t - \E[X]$ \subsubsection{Chernoff bounds} The Chernoff bounds are specifically designed for Bernoulli-variables \setcounter{all}{70} \begin{theorem}[]{Chernoff bounds} Let $\mathcal{X}_1, \ldots, \mathcal{X}_n$ be independent Bernoulli-distributed random variables with $\Pr[\mathcal{X}_i = 1] = p_i$ and $\Pr[\mathcal{X}_i = 0] = 1 - p_i$. Then we have for $\mathcal{X} := \sum_{i = 1}^{n} \mathcal{X}_i$ \begin{enumerate}[label=(\roman*)] \item $\Pr[\mathcal{X} \geq (1 + \delta)\E[\mathcal{X}]] \leq e^{-\frac{1}{3}\delta^2\E[\mathcal{X}]}$ \largehspace for all $0 < \delta \leq 1$ \item $\Pr[\mathcal{X} \leq (1 - \delta)\E[\mathcal{X}]] \leq e^{-\frac{1}{2}\delta^2\E[\mathcal{X}]}$ \largehspace for all $0 < \delta \leq 1$ \item $\Pr[\mathcal{X} \geq t] \leq 2^{-t}$ \largehspace for $t \geq 2e\E[\mathcal{X}]$ \end{enumerate} \end{theorem} We determine the $\delta$ in the inequality by finding it such that $t = (1 + \delta)\E[X]$ or, for the second one, $t = (1 - \delta)\E[X]$. For the third one, no $\delta$ is required \newpage \subsection{Randomized Algorithms} In comparison to \textit{deterministic} algorithms, here, the output is \textbf{\textit{not}} guaranteed to be equal for the same input data after reruns. While this can be an issue in some cases, it allows us to usually \textit{significantly} reduce time complexity and thus allows us to solve $\mathcal{N}\mathcal{P}$-complete problems in some cases in polynomial time even. The problem with \textit{true} randomness is that it is hardly attainable inside computers, some kind of predictability will always be there in some form or another, especially if the random number generator is algorithm-based, not on random events from the outside. In this course, we will though assume that random numbers generated by random number generators provided by programming languages actually provide independent random numbers In the realm of randomized algorithms, one differentiates between two approaches \begin{tables}{lcc}{Types of randomized algorithms} & \shade{ForestGreen}{Monte-Carlo-Algorithm} & \shade{ForestGreen}{Las-Vegas-Algorithm} \\ Always correct & \ding{55} & \ding{51} \\ Constant runtime & \ding{51} & \ding{55} \\ \end{tables} While this is the normal case for Las-Vegas Algorithms, we can also consider the following: Let the algorithm terminate if a certain runtime bound has been exceeded and return something like ``???'' if it has not found a correct answer just yet. We will most commonly use this definition of a Las-Vegas-Algorithm in this course \subsubsection{Reduction of error} \setcounter{all}{72} \begin{theorem}[]{Error reduction Las-Vegas-Algorithm} Let $\mathcal{A}$ be a Las-Vegas-Algorithm, where $\Pr[\mathcal{A}(I) \text{correct}] \geq \varepsilon$ Then, for all $\delta > 0$ we call $\mathcal{A}_{\delta}$ an algorithm that calls $\mathcal{A}$ until we either get a result that is not ``???'' or we have executed $N = \varepsilon^{-1} \ln(\delta^{-1})$ times. For $\mathcal{A}_{\delta}$ we then have \begin{align*} \Pr[\mathcal{A}_{\delta}(I) \text{correct}] \geq 1 - \delta \end{align*} \end{theorem} On the other hand, for Monte-Carlo-Algorithms, the probability of error decreases rapidly. It is not easy though to determine whether or not an answer is correct, unless the algorithm only outputs two different values \textit{and} we know that one of these values is \textit{always} correct \setcounter{all}{74} \begin{theorem}[]{Error reduction} Let $\mathcal{A}$ be a randomized algorithm, outputting either \textsc{Yes} or \textsc{No}, whereas \begin{align*} \Pr[\mathcal{A}(I) = \textsc{Yes}] & = 1 \mediumhspace \text{if $I$ is an instance of \textsc{Yes}} \\ \Pr[\mathcal{A}(I) = \textsc{No}] & \geq \varepsilon \mediumhspace \text{if $I$ is an instance of \textsc{No}} \end{align*} Then, for all $\delta > 0$ we call $\mathcal{A}_{\delta}$ the algorithm that calls $\mathcal{A}$ until either \textsc{No} is returned or until we get \textsc{Yes} $N = \varepsilon^{-1} \ln(\delta^{-1})$ times. Then for all instances $I$ we have \begin{align*} \Pr[\mathcal{A}_{\delta}(I) \text{correct}] \geq 1 - \delta \end{align*} \end{theorem} This can also be inverted and its usage is very straight forward. \newpage If we however have Monte-Carlo-Algorithms that have two-sided errors, i.e. there is error in both directions, we have \begin{theorem}[]{Two-Sided Error reduction} Let $\varepsilon > 0$ and $\mathcal{A}$ be a randomized algorithm, that always outputs either \textsc{Yes} or \textsc{No} whereas \begin{align*} \Pr[\mathcal{A}(I) \text{correct}] \geq \frac{1}{2} + \varepsilon \end{align*} Then, for all $\delta > 0$ we call $\mathcal{A}_{\delta}$ the algorithm that calls $\mathcal{A}$ $N = 4 \varepsilon^{-2} \ln(\delta^{-1})$ independent times and then returns the largest number of equal responses. Then we have \begin{align*} \Pr[\mathcal{A}_{\delta}(I) \text{correct}] \geq 1 - \delta \end{align*} \end{theorem} For randomized algorithms for optimization problems like the calculation of a largest possible stable set, as seen in Example 2.37 in the script, we usually only consider if they achieve the desired outcome. \begin{theorem}[]{Optimization problem algorithms} Let $\varepsilon > 0$ and $\mathcal{A}$ be a randomized algorithm for a maximization problem, for which \begin{align*} \Pr[\mathcal{A}(I) \geq f(I)] \geq \varepsilon \end{align*} Then, for all $\delta > 0$ we call $\mathcal{A}_{\delta}$ the algorithm that calls $\mathcal{A}$ $N = 4 \varepsilon^{-2} \ln(\delta^{-1})$ independent times and then returns the \textit{best} result. Then we have \begin{align*} \Pr[\mathcal{A}_{\delta}(I) \geq f(I)] \geq 1 - \delta \end{align*} \end{theorem} For minimization problems, analogously, we can replace $\geq f(I)$ with $\leq f(I)$ % P154 \subsubsection{Sorting and selecting} The QuickSort algorithm is a well-known example of a Las-Vegas algorithm. It is one of the algorithms that \textit{always} sorts correctly, but its runtime depends on the selection of the pivot elements, which happens randomly. \begin{recall}[]{QuickSort} As covered in the Algorithms \& Data Structures lecture, here are some important facts \begin{itemize} \item Time complexity: \tcl{n \log(n)}, \tct{n \log(n)}, \tco{n^2} \item Performance is dependent on the selection of the pivot (the closer to the middle the better, but not in relation to its current location, but rather to its value) \item In the algorithm below, \textit{ordering} refers to the operation where all elements lower than the pivot element are moved to the left and all larger than it to the right of it. \end{itemize} \end{recall} \begin{algorithm} \caption{\textsc{QuickSort}} \begin{algorithmic}[1] \Procedure{QuickSort}{$A$, $l$, $r$} \If{$l < r$} \State $p \gets$ \textsc{Uniform}($\{l, l + 1, \ldots, r\}$) \Comment{Choose pivot element randomly} \State $t \gets$ \textsc{Partition}($A$, $l$, $r$, $p$) \Comment{Return index of pivot element (after ordering)} \State \Call{QuickSort}{$A$, $l$, $t - 1$} \Comment{Sort to the left of pivot} \State \Call{QuickSort}{$A$, $t$, $r$} \Comment{Sort to the right of pivot} \EndIf \EndProcedure \end{algorithmic} \end{algorithm} \newcommand{\qsv}{\mathcal{T}_{i, j}} We call $\qsv$ the random variable describing the number of comparisons executed during the execution of \textsc{QuickSort}($A, l, r$). To prove that the average case of time complexity in fact is \tct{n \log(n)}, we need to show that \begin{align*} \E[\qsv] \leq 2(n + 1) \ln(n) + \text{\tco{n}} \end{align*} which can be achieved using a the linearity of the expected value and an induction proof. (Script: p. 154) \fhlc{Cyan}{Selection problem} For this problem, we want to find the $k$-th smallest value in a sequence $A[1], \ldots, A[n]$. An easy option would be to simply sort the sequence and then return the $k$-th element of the sorted array. The only problem: \tco{n \log(n)} is the time complexity of sorting. Now, the \textsc{QuickSelect} algorithm can solve that problem in \tco{n} \begin{algorithm} \caption{\textsc{QuickSelect}} \begin{algorithmic}[1] \Procedure{QuickSelect}{$A, l, r, k$} \State $p \gets \textsc{Uniform}(\{l, l + 1, \ldots, r\})$ \Comment{Choose pivot element randomly} \State $t \gets \textsc{Partition}(A, l, r, p)$ \If{$t = l + k - 1$} \State \Return{$A[t]$} \Comment{Found element searched for} \ElsIf{$t > l + k - 1$} \State \Return{\Call{QuickSelect}{$A, l, t - 1, k$}} \Comment{Searched element is to the left} \Else \State \Return{\Call{QuickSelect}{$A, t + 1, r, k - t$}} \Comment{Searched element is to the right} \EndIf \EndProcedure \end{algorithmic} \end{algorithm} \subsubsection{Primality test} Deterministically testing for primality is very expensive if we use a simple algorithm, namely \tco{\sqrt{n}}. There are nowadays deterministic algorithms that can achieve this in polynomial time, but they are very complex. Thus, randomized algorithms to the rescue, as they are much easier to implement and also much faster. With the right precautions, they can also be very accurate, see theorem 2.74 for example. A simple randomized algorithm would be to randomly pick a number on the interval $[2, \sqrt{n}]$ and checking if that number is a divisor of $n$. The problem: The probability that we find a \textit{certificate} for the composition of $n$ is very low (\tco{\frac{1}{n}}). Looking back at modular arithmetic in Discrete Maths, we find a solution to the problem: \begin{theorem}[]{Fermat's little theorem} If $n \in \N$ is prime, for all numbers $0 < a < n$ we have \begin{align*} a^{n - 1} \equiv 1 \texttt{ mod } n \end{align*} \end{theorem} Using exponentiation by squaring, we can calculate $a^{n - 1} \texttt{ mod } n$ in \tco{k^3}. \begin{algorithm} \caption{\textsc{Miller-Rabin-Primality-Test}}\label{alg:miller-rabin-primality-test} \begin{algorithmic}[1] \Procedure{Miller-Rabin-Primality-Test}{$n$} \If{$n = 2$} \State \Return \texttt{true} \ElsIf{$n$ even or $n = 1$} \State \Return \texttt{false} \EndIf \State Choose $a \in \{2, 3, \ldots, n - 1\}$ randomly \State Calculate $k, d \in \Z$ with $n - 1 = d2^k$ and $d$ odd \Comment{See below for how to do that} \State $x \gets a^d \texttt{ mod } n$ \If{$x = 1$ or $x = n - 1$} \State \Return \texttt{true} \EndIf \While{not repeated more than $k - 1$ times} \Comment{Repeat $k - 1$ times} \State $x \gets x^2 \texttt{ mod } n$ \If{$x = 1$} \State \Return \texttt{false} \EndIf \If{$x = n - 1$} \State \Return \texttt{true} \EndIf \EndWhile \State \Return \texttt{false} \EndProcedure \end{algorithmic} \end{algorithm} This algorithm has time complexity \tco{\ln(n)}. If $n$ is prime, the algorithm always returns \texttt{true}. If $n$ is composed, the algorithm returns \texttt{false} with probability at least $\frac{3}{4}$. \newpage \fhlc{Cyan}{Notes} We can determine $k, d \in \Z$ with $n - 1 = d2^k$ and $d$ odd easily using the following algorithm \begin{algorithm} \caption{Get $d$ and $k$ easily}\label{alg:get-d-k} \begin{algorithmic}[1] \State $k \gets 1$ \State $d \gets n - 1$ \While{$d$ is even} \State $d \gets \frac{d}{2}$ \State $k \gets k + 1$ \EndWhile \end{algorithmic} \end{algorithm} What we are doing here is removing the all even divisors from $d$, to make it odd. \subsubsection{Target-Shooting} The Target-Shooting problem is the following: Given a set $U$ and a subset thereof $S \subseteq U$, whose cardinality is unknown, how large is the quotient $\frac{|S|}{|U|}$. We define an indicator variable for $S$ by $I_S : U \rightarrow \{0, 1\}$, where $I_S(u) = 1$ if and only if $u \in S$ The Target-Shooting algorithm approximates the above quotient: \begin{algorithm} \caption{Target-Shooting}\label{alg:target-shooting} \begin{algorithmic}[1] \Procedure{TargetShooting}{$N, U$} \State Choose $u_1, \ldots, u_N \in U$ randomly, uniformly and independently \State \Return $N^{-1} \cdot \sum_{i = 1}^{N} I_S(u_i)$ \EndProcedure \end{algorithmic} \end{algorithm} For this algorithm, two assumptions have to be made: \begin{itemize} \item $I_S$ has to be efficiently computable \item We need an efficient procedure to choose a uniformly random element from the set $U$. \end{itemize} \begin{theorem}[]{Target-Shooting} Let $\delta, \varepsilon > 0$. If $N \geq 3 \frac{|U|}{|S|} \cdot \varepsilon^{-2} \cdot \ln\left(\frac{2}{\delta}\right)$, the output of \textsc{Target-Shooting} is, with probability \textit{at least} $1 - \delta$, on the Interval $\left[ (1 - \varepsilon) \frac{|S|}{|U|}, (1 + \varepsilon) \frac{|S|}{|U|} \right]$ \end{theorem} \subsubsection{Finding dulicates} Deterministically, this could be achieved using a \textsc{HashMap} or the like, iterating over all items, hashing them and checking if the hash is available in the map. This though uses significant amounts of extra memory and is also computationally expensive. A cheaper option (in terms of memory and time complexity) is to use a \textit{Bloomfilter}. The randomized algorithm also use Hash-Functions, but are not relevant for the exam. \begin{definition}[]{Hash Function} If we have $\mathcal{S} \subseteq U$, i.e. our dataset $\mathcal{S}$ is subset of a \textit{universe} $U$, a \textit{hash function} is an image $h: U \rightarrow [m]$, whereas $[m] = \{1, \ldots, m\}$ and $m$ is the number of available memory cells. It is assumed that we can \textit{efficiently} calculate a hash for an element and that all elements are randomly distributed, i.e. we have for $u \in U$ $\Pr[h(u) = i] = \frac{1}{m}$ for all $i \in [m]$ \end{definition}