Key Discrete Distributions

\[\newcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\V}{\mathbb{V}}\]

In this set of notes, we introduce the key discrete distributions:

Our focus in this set of notes is defining the formal properties of each distribution. Our discussion in class will focus more on the “when and why?” questions dictating when we should use a particular distribution.

Review: Random Variables, Mean, and Variance

Recall that a random variable, \(X\), is functionally specified by two components:

  • The support of \(X\) is the set of values \(X\) can take, encoded as real numbers. In the discrete random variable context, the support of \(X\) is typically integral: \(\textsf{supp}(X) \subseteq \mathbb{Z}\)
  • The probability mass function (PMF) of \(X\) which is a mapping from \(\textsf{supp}(X)\) to \([0, 1]\) satisfying: \[ \sum_{x \in \textsf{supp}(X)} \P(X = x) = 1\]

Alongside these two defining quantities, we often report the mean (or expected value) and variance of a random variable:

\[\begin{align*} \E[X] &= \sum_{x \in \textsf{supp}(X)} x\, \P(X = x) \\ \V[X] &= \E[(X - \E[X])^2] \\ &= \sum_{x \in \textsf{supp}(X)} (x - \E[X])^, \P(X = x) \\ &= \left(\sum_{x \in \textsf{supp}(X)} x^2\, \P(X = x)\right) - \E[X]^2 \\ &= \E[X^2] - \E[X]^2 \end{align*}\]

Recall that the variance \(\V[X]\) is always non-negative and is strictly positive if \(|\textsf{supp}(X)| \geq 2\) - that is, if \(X\) is “truly” random in the sense of possibly taking more than one value.1

Bernoulli

Our simplest distribution is the Bernoulli distribution, named after the great Swiss mathematician Jacob Bernoulli. The Bernoulli distribution models a single “coin flip”-type event: that is, an event with two possible outcomes conventionally called “success” and “failure”. To make this a random variable, we associate “success” with the value \(1\) and “failure” with the value \(0\).

Note that, even though we called these two outcomes “success” and “failure”, we can use a Bernoulli distribution for anything with two outcomes: left vs. right, up vs. down, right vs. wrong, or happened vs didn’t happen. These last two pairings are incredibly important. Whenever we make a binary prediction, it is either correct or incorrect - a Bernoulli outcome. Because of this, Bernoulli random variables are commonly used to study the predictive accuracy of classification systems, an incredibly important topic in Machine Learning. Bernoulli distributions are also “the most extreme” version of bounded random variables (putting all of the probability on two endpoints) so we can often bound the performance of any predictive model for bounded outcomes using Bernoulli variables.

A Bernoulli distributed \(X\) is then defined by the PMF:

\[ P(X = 1) = p \text{ and } P(X = 0) = 1 - P(X \neq 0) = 1 - P(X = 1) = 1 - p = q \]

Tip

Make sure you can explicitly justify each step connecting \(P(X = 0)\) to \(1 - p\).

If we are clever - and recall that \(p^0 = 1\) for any \(p \neq 0\)-we can write the Bernoulli PMF as

\[ \P(X = x) = p^x(1-p)^{1-x} = p^xq^{1-x} \text{ for } x \in \{0, 1\}\]

Note that, even though we sometimes write the Bernoulli with two parameters, \(p,q\), it is really a one-parameter distribution since \(q = 1 - p\) by construction.

The mean and variance are quite easy to calculate:

\[\begin{align*} \E[X] &= \sum_{x \in \textsf{supp}(X)} x * \P(X = x) \\ &= 0 * \P(X = 0) + 1 * \P(X = 1) \\ &= 0 * (1 - p) + 1 * p \\ &= p \end{align*}\]

Tip

Why are we not surprised to see “expectation = probability” here? Think about the connection between Bernoulli random variables and indicator functions.

While we can compute the variance directly, it’s a bit easier to work from \(\V[X] = \E[X^2] - \E[X]^2\) if we note that \(X^2 = X\) for a Bernoulli random variable.

\[\begin{align*} \V[X] &= \E[X^2] - \E[X]^2 \\ &= \E[X] - \E[X^2] \\ &= p - p^2 \\ &= p(1 - p) \\ &= pq \end{align*}\]

From this expression, it’s not hard to see that the variance of a Bernoulli random variable is never more than \(0.25\) and that the maximum is obtained when \(p = q = 0.5\): that is, the “50/50” coin-flip is the “most random” coin.

Rademacher

A Rademacher variable is a close cousin of the Bernoulli often used to model incremental processes that, at each step, either become “a bit better” or “a bit worse”. Specifically, while a Bernoulli is a 0/1 random variable, Rademachers take values \(\pm 1\).

\[P(X = x) = \begin{cases} 1 & \text{ with probability } p \\ 0 & \text{ with probability } q = 1 - p \end{cases}\]

While Bernoulli variables come with all sorts of weights, Rademachers are normally symmetric, taking \(\pm 1\) with equal 50% probability.

To get the mean and variance of a Rademacher, let’s use the linearity of expectation. Let \(R \sim \text{Rademacher}(p)\); then \(R = 2B - 1\) where \(B \sim \text{Bernoulli}(p)\). Hence,

\[\begin{align*} \E[R] &= \E[2B - 1] \\ &= 2\E[B] - 1 \\ &= 2p - 1 \end{align*}\]

Similarly,

\[\begin{align*} \V[R] &= \V[2B - 1] \\ &= 2^2 \V[B] \\ &= 4 * p * (1-p) \\ &= 4pq \end{align*}\]

As before, this is “most random” when \(p = q = 0.5\). In this case, however we get a variance of 1 for \(R\) instead of \(0.25\) for \(B\). Tricks like this make Rademachers very useful in theoretical analyses.

Binomial

The Binomial distribution arises as the sum of a known, fixed number of \(n\) identically and independently distributed (IID) Bernoulli random variables. This concept of IID is incredibly important and we will see it many times throughout this course.

Because a binomial is a sum if IID elements, its mean and variance are relatively simple to compute. Let \(X \sim \text{Binomial}(n, p)\) - that is, let \(X\) be the sum of \(n\) \(\text{Bernoulli}(p)\) random variables, \(X_1, X_2, \dots, X_n\). Then

\[\begin{align*} \E[X] &= \E\left[\sum_{i=1}^n X_i\right] \\ &= \sum_{i=1}^n \E[X_i] \text{ by linearity of expectation} \\ &= \sum_{i=1}^n p \\ &= np \end{align*}\]

Similarly,

\[\begin{align*} \V[X] &= \V\left[\sum_{i=1}^n X_i\right] \\ &= \sum_{i=1}^n \V[X_i] \text{ (Variances add for independent RVs)}\\ &= \sum_{i=1}^n p(1-p) \\ &= n p(1-p) \end{align*}\]

While the mean and variance are quite easy, it’s a bit trickier to derive the PMF from first principles. This is an important element of “probability thinking” - it is often easier to compute aspects of distributions indirectly instead of computing the distribution in toto and then deriving its properties. In particular, when you can break a problem into a set of IID elements - as we have done here - tools like linearity, expectation, and variance make life quite easy.

Suppose we want to compute \(\P(X = x)\). We know that we must have \(x\) successes and \(n - x\) failures for a sum of \(x\). The probability of the event

\[\P\left[(X_1, X_2, \dots, X_n) = (\underbrace{1, 1, \dots, 1}_{\text{$x$ times}}, \underbrace{0, 0, \dots, 0}_{\text{$n-x$ times}})\right]\]

can be computed by independence of the individual Bernoullis:

\[\begin{align*} \P\left[(X_1, X_2, \dots, X_n) = (\underbrace{1, 1, \dots, 1}_{\text{$x$ times}}, \underbrace{0, 0, \dots, 0}_{\text{$n-x$ times}})\right] &= \prod_{i=1}^n P(X_i = x_i) \\ &= \prod_{i=1}^x \P(X_i = 1) * \prod_{i=x+1}^{n} \P(X_i = 0) \\ &= \prod_{i=1}^x p * \prod_{i=x+1}^{n} (1-p) \\ &= p^x (1-p)^{n-x} \end{align*}\]

But \(\P(X = x)\) is not just this particular ordering of \((X_1, \dots, X_n)\). For purposes of the Binomial random variable, we don’t really care what order these happened, so we have \(\binom{n}{x}\) possible orderings (of \(n\) flips, choosing \(x\) of them to be 1). Because the set of possible orderings is a disjoint partition, we can get the aggregate probability \(\P(X = x)\) by multipling \(\binom{n}{x}\) by the probability of each outcome, which we already showed was \(p^x(1-p)^{n-x}\). Taken together, this gives us:

\[\P(X = x) = \binom{n}{x}p^x(1-p)^x \text{ for } x \in \{0, \dots, n\}\]

We pause here to note that the name binomial distribution comes from the similarity between this PMF and the standard binomial expansion:

\[(a + b)^n = \sum_{x=0}^n \binom{n}{x} a^xb^{n-x} \]

We get the binomial distribution by setting \(a = p, b = 1 - p\). This lets us easily confirm that the sum of the binomial PMF is indeed 1, as we require: \[\begin{align*} \sum_{x = 0}^n \P(X = x) &= \sum_{x=0}^n \binom{n}{x}p^x(1-p)^x \\ &= (p + (1-p))^n \\ &= 1^n \\ &= 1 \end{align*}\]

Poisson

The Binomial distribution occurs with a fixed number of events \(n\) and known probability \(p\). An important ‘limiting’ case is where the number of events is very large and the probability is very small; in this case, the important number is the expected number of successes \(\mu = n * p\). We model this case as a Poisson random variable. Specifically, a Poisson model is a model for count values that are, on average, reasonably small (around \(\mu\)) but potentially unbounded.

The Poisson PMF with mean \(\mu\) is given by \[\P(X = k) \frac{\mu^k e^{-\mu}}{k!} \text{ for } k \in \{0, 1, 2 \dots, \} \]

As we have discussed before, factorials grow even more rapidly than exponentials, so this tends towards zero as \(k\) gets large: that is, very large counts become exceedingly unlikely. You can derive the Poisson PMF from the binomial PMF by setting \(p = \mu / n\) and taking the \(n \to \infty\) limit, but the arithmetic is a bit cumbersome and so we do not pursue it here.

While we have already called \(\mu\), the Poisson mean, we can show this explicitly:

\[\begin{align*} \E[X] &= \sum_{x=0}^{\infty} x \P(X = x) \\ &= \sum_{x=0}^{\infty} x * \frac{\mu^x e^{-\mu}}{x!} \\ &= \sum_{x=1}^{\infty} x * \frac{\mu^x e^{-\mu}}{x!} \\ &= e^{-\mu} \sum_{x=1}^{\infty} x * \frac{\mu^x}{x!} \\ &= e^{-\mu} \mu \sum_{x=1}^{\infty} \frac{\mu^{x-1}}{(x-1)!} \\ &= e^{-\mu} \mu \sum_{y=0}^{\infty} \frac{\mu^{y}}{y!} \\ &= e^{-\mu} \mu e^{\mu} \\ &= \mu \end{align*} \]

Next, we turn to the variance. As similar argument shows \(\E[X^2] = \mu^2 + \mu\), so we get \(\V[X] = \E[X^2] - \E[X]^2 = \mu^2 + \mu - \mu^2 = \mu\).

This is a remarkable property: for a Poisson random variable, a single parameter controls both the mean and the variance. Further more, as the expected number of counts becomes larger, so does the variance.

If we think back to the binomial connection, we can see how this arises: the binomial variance is given by \(n p q = np(1-p) = np - np^2\). We create a Poisson limit by setting \(p = \mu / n\) and letting \(n \to \infty\). Here, this yields:

\[np - np^2 = n * \left(\frac{\mu}{n}\right) - n* \left(\frac{\mu}{n}\right)^2 = \mu - \mu^2 / n\].

As \(n \to \infty\), we simply get the variance \(\mu\) which matches direct calculation. At a high level, for the Poisson mean to get larger, we need \(p\) to get larger, which in turn raises the variance (since we are far below the ‘turning point’ of binomial variance at \(p = 0.5\)).

Geometric

So far, we have considered distributions that count the number of times “success” happens out of a fixed number of trials. We now turn to distributions with a fixed number of successes, but a random number of total trials. In these models, the random variable of interest is the total number of trials.2

Our basic model is the geometric distribution. The total number of coin flips until we get our first heads (inclusive. If we denote this variable as \(X\), we can easily see that the PMF is given by:

\[\P(X = x) = p(1-p)^{x-1} \text{ for } x \in \{1, 2, \dots\}\]

This PMF arises because we don’t need to account for order: we know the last flip is a success with probability \(p\) and the previous \(x-1\) flips are each failures, occurring with probability \(q = 1-p\). By independence, these probabilities can be combined with simple multiplication, giving the resulting PMF.

Mean and variance can be computed in many ways. Here, we’ll show a general approach that uses differentiation and geometric series creatively to compute several useful quantities in the same manner.

Recall that a geometric series satisfies:

\[\sum_{i=0}^{\infty} ar^i = \frac{a}{1-r}\]

We can differentiate both sides of this with respect to \(r\) to find:

\[\begin{align*} \sum_{i=0}^{\infty} ar^i &= \frac{a}{1-r} \\ \frac{\text{d}}{\text{d}r}\sum_{i=0}^{\infty} ar^i &= \frac{\text{d}}{\text{d}r}\frac{a}{1-r} \\ \sum_{i=0}^{\infty} air^{i-1} &= \frac{a}{(1-r)^2} * (-1) * (-1)\\ \sum_{i=0}^{\infty} air^{i-1} &= \frac{a}{(1-r)^2} \end{align*}\]

where the right hand side picks up two \(-1\) terms: one from the exponent on the denominator and one from the minus sign inside the denominator (chain rule).

We can repeat this trick again:

\[\begin{align*} \sum_{i=0}^{\infty} air^{i-1} &= \frac{a}{(1-r)^2} \\ \frac{\text{d}}{\text{d}r}\sum_{i=0}^{\infty} air^{i-1} &= \frac{\text{d}}{\text{d}r}\frac{a}{(1-r)^2} \\ \sum_{i=0}^{\infty} ai(i-1)r^{i-2} &= \frac{2a}{(1-r)^3} \end{align*}\]

With these three formulae in hand, we are ready to show the basic properties of a geometric random variable:

\[\begin{align*} \sum_{x=1}^{\infty} \P(X = x) &= \sum_{x=1}^{\infty} p(1-p)^{x-1} \\ &= \sum_{y=0}^{\infty} p(1-p)^y \\ &= \frac{p}{1-(1-p)} \\ &= 1 \end{align*}\]

where we used only the “basic” geometric series formula here.

Next, for expectation:

\[\begin{align*} \E[X] &= \sum_{x=1}^{\infty} x \P(X = x) \\ &= \sum_{x=1}^{\infty} x p(1-p)^{x-1} \\ &= \frac{p}{(1-(1-p))^2} \\ &= \frac{p}{p^2} \\ &= \frac{1}{p} \end{align*}\]

where we used our first differentiated formula. This fits our intuition: if something happens \(p\) times, we need \(1/p\) tries for it to happen on average.

Similarly, the second differentiated formula can be used to compute \(\E[X(X-1)] = \E[X^2] - \E[X]\) and from there, \(\V[X]\) = $. Again, comparing against intuition, variance is highest for small \(p\) - if something is very rare, it’s very hard to say how long until it happens.

The geometric distribution has a remarkable property called memorylessness: \(P(X > m + n | X > n) = P(X > m)\). This says that, if you have already tried \(n\) times, the probability of taking \(m + n\) tries is the same as \(m\) tries if starting afresh. The coin flip process is “memoryless” in that it doesn’t remember or depend upon what came before. Because of the memorylessness property, we can never say a success is “due up” in a geometric process. This is at stark odds with our intuition about gambling - if something hasn’t happened for a while, it’s bound to happen. If the events are truly IID, this simply isn’t the case.

Memorylessness is a bit magic: the geometric distribution (and its close kin) is actually the only discrete distribution with this property. The other famous distribution with this property is the (continuous) exponential distribution, which has the same negative exponential structure. For this reason, the geometric distribution is sometimes called the discrete exponential distribution, though that name has mainly fallen out of paper.

To show memorylessness, we can use some of our basic principles of conditional PMFs. Before doing so, let’s define some useful alternative formulations of the PMF.

  • The CDF - cumulative distribution function - is \[F(x) = \P(X \leq x) = \sum_{i=1}^{x} \P(X = x)\]
  • The CCDF - complementary CDF - is \[\overline{F}(x) = \P(X > x) = \sum_{x+1}^{\infty} \P(X = x)\]

Clearly, \(F(x) + \overline{F}(x) = 1\) for all \(x\). With these in hand, it’s easy to state the manipulation formulas for “self-conditioned” random variables.

  • \(\P(X = x | X \leq x) = \P(X = x) / F(x)\)
  • \(\P(X = x | X > x) = \P(X = x) / \overline{F}(x)\)

We will use the latter form for showing memorylessness of the geometric.

First, note that

\[\begin{align*} F(x) &= \sum_{i=1}^x p(1-p)^{i-1} \\ &= p \sum_{j=0}^{x-1} (1-p)^j \\ &= p * \frac{1-(1-p)^{x}}{1-(1-p)} \\ &= p * \frac{1 - (1-p)^x}{p}\\ &= 1 - (1-p)^x \end{align*}\]

using the formula for a finite geometric series. From this, we have \(\overline{F}(x) = (1-p)^x\). At this point, you should be realizing that things are likely to work out very nicely when dividing \(\P(X = x)\) and \(\overline{F}(x)\).

Hence, \[\begin{align*} \P(X = x + y | X > y) &= \frac{\P(X = x + y) }{\overline{F}(y)} \\ &= \frac{p(1-p)^{x+y-1}}{(1-p)^x} \\ &= p(1-p)^{y-1} \\ &= P(Y = y) \end{align*}\] where \(Y\) is a “new” (restarted) random variable.

Negative Binomial

TODO

Hypergeometric

TODO

Footnotes

  1. Recall that we sometimes consider constants, \(a\), as “degenerate” random variables satisfying \(P(X = a) = 1\) to make the statements of our theorems easier.↩︎

  2. Unfortunately, there are two conflicting conventions used for some of these distributions: some count the total number of trials (success + failure) while others count only the number of failures. This is not a hard change - it’s just a simple \(+s\) for \(s\) successes - but it makes comparing formulae from different references a bit inconvenient.↩︎