The Multivariate Normal Distribution
\[\newcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\V}{\mathbb{V}} \newcommand{\R}{\mathbb{R}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bA}{\mathbf{A}} \newcommand{\C}{\mathbb{C}}\]
In this set of notes, we begin our study of the most important distribution for random vectors, the multivariate normal distribution. Before we dig into the multivariate normal, however, it’s worth briefly discussing the univariate normal and how the concepts of variance and covariance apply to vectors.
Univariate Normal Distribution
The univariate normal distribution is a continuous random variable with support on \(\R\), that is, it takes all real-values. The standard normal distribution, almost universally called \(Z\), has PDF:
\[f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\]
The “core” of this distribution is reasonably straight-forward, \(e^{-z^2/2}\). This passes the “sniff” test for a distribution - it goes to zero quite quickly as \(|z|\) gets large, so it’s very believable that we have a finite integral. The “normalization constant” of \(1/\sqrt{2\pi}\) is a bit freaky, but there are deep reasons for it.
By definition, we can compute the mean and variance of \(Z\):
\[\begin{align*} \E[Z] &= \int z f_Z(z)\,\text{d}z \\ &= \frac{1}{\sqrt{2\pi}} \int z e^{-z^2/2}\,\text{d}z \\ &= \frac{1}{\sqrt{2\pi}} \int z e^{-z^2/2}\,\text{d}z \\ &= \frac{1}{\sqrt{2\pi}}\left(e^{-z^2/2}\right)_{z=-\infty}^{\infty} \\ &= \frac{1}{\sqrt{2\pi}}\left(e^{-\infty^2/2} - e^{-(-\infty)^2/2}\right) \\ &= \frac{1}{\sqrt{2\pi}}(0 - 0) \\ &= 0 \end{align*}\]
As we would expect for a distributoin with a PDF that is symmetric around 0.
The variance is a bit trickier:
\[\begin{align*} \V[Z] &= \E[Z^2] - \E[Z]^2 \\ &= \E[Z^2] - 0 \\ &= \E[Z^2] \\ &= \int z^2 f_Z(z)\,\text{d}z \\ &= \dots &= 1 \end{align*}\]
where the dots capture some cumbersome, but standard algebra.
The normal distribution has a somewhat special property: it is a stable distribution, so if \(X\) has a normal distribution, so does \(aX + b\) for any scalars \(a, b \in \R\). We can use this fact to derive the distribution of any normal random variable.
If \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\), it can be shown that \(X \buildrel d\over= \mu + \sigma Z\) for standard normal \(Z\). (The \(\buildrel d\over=\) symbol means these two quantities have the same distribution. For these notes, it’s fine to interpret it as a standard equality.) Specifically, if \(X \buildrel d\over= a + bZ\) we can find that these must be the values of \(a, b\):
\[\begin{align*} \mu &= \E[X] \\ &= \E[a+bZ] \\ &= a + b\E[Z] \\ &= a + b * 0 \\ &= a \\ \implies a &= \mu \end{align*}\]
and
\[\begin{align*} \sigma^2 &= \V[X] \\ &= \V[a + bZ] \\ &= b^2 \V[Z] \\ &= b^2 * 1 \\ \implies b &= \sigma \end{align*}\]
We can use this relationship to find the PDF and CDF of \(X \sim \mathcal{N}(\mu, \sigma^2)\). For the CDF, note that:
\[\begin{align*} F_X(x) &= \P(X \leq x) \\ &= \P(\mu + \sigma Z \leq x) \\ &= \P\left(Z \leq \frac{x - \mu}{\sigma}\right) \\ &= \Phi\left(\frac{x - \mu}{\sigma}\right) \end{align*}\]
where \(\Phi(\cdot)\) is the CDF of the standard normal \(Z\). \(\Phi(\cdot)\) is one of those functions that doesn’t have a “clean” formula in the usual sense, but it is so useful that basically all calculators and computers have it, or an equivalent function, built-in: C++
, Python
, R
and javascript
. In that way, it is quite like the trig functions \(\sin(\cdot)\), \(\cos(\cdot)\), and \(\tan(\cdot)\).
Note that, by construction,
\[\Phi(z) = \P(Z \leq z) = \int_{-\infty}^z f_Z(z')\,\text{d}z' = \int_{-\infty}^z \frac{e^{-(z')^2/2}}{\sqrt{2\pi}}\,\text{d}z'\]
With this in hand, we can get the PDF for \(X\) as well:
\[\begin{align*} f_X(x) &= \frac{\text{d}}{\text{d}x}F_X(x) \\ &= \frac{\text{d}}{\text{d}x}\Phi\left(\frac{x - \mu}{\sigma}\right) \\ &= \Phi'\left(\frac{x - \mu}{\sigma}\right) * \frac{\text{d}}{\text{d}x}\left(\frac{x - \mu}{\sigma}\right) \\ &= \phi\left(\frac{x - \mu}{\sigma}\right) * \frac{1}{\sigma} \\ &= \left(\frac{\exp\left\{-\frac{\left(\frac{x-\mu}{\sigma}\right)^2}{2}\right\}}{\sqrt{2\pi}}\right)\frac{1}{\sigma} \\ &= \frac{\exp\left\{-(x-\mu)^2/2\sigma^2\right\}}{\sqrt{2\pi\sigma^2}} \end{align*}\]
Not so bad - and we don’t even have to compute the mean or variance since we started with those.
We will have a lot more to say about the normal distribution as we proceed in this course.
Multivariate Normal Distribution
To develop the multivariate normal distribution, we will use the following multivariate notion of stability. A random vector \(\bX\) follows a (multivariate) normal distribution if \(\langle \ba, \bX \rangle = \sum a_i X_i\) has a normal distribution for all vectors \(\ba\). This is a strong claim, as it requires us to check an infinite number of inner products. In fact, this is essentially to strong for us to ever actually verify. We typically take multivariate normality as a refutable assumption.
Let’s start exploring the ramifications of this definition for difference choices of \(\ba\):
First we can take \(\ba\) to be the standard Euclidean basis vectors, \(\mathbf{e}_i\). The basis vectors are vectors that are all zeros except for a 1 in the \(i\)th location: e.g. \(\R^3 \ni \mathbf{e}_2 = (0, 1, 0)\). Clearly, \(\langle \mathbf{e}_i, \bX \rangle = X_i\). This tells us that the marginal distribution of each \(X_i\), that is, each component of \(\bX\), must be normal.
In particular, we can let \(\E[\langle \mathbf{e}_i, \bX\rangle] = \mu_i\). From here, some linear algebra lets us see that
\[\E[\bX] = \vec{\mu} = (\E[X_1], \E[X_2], \dots)\]
That is, the mean vector of \(\bX\) is just the individual means of each \(X_i\) component.
Now consider a general \(\ba\). What is \(\E[\langle \ba, \bX\rangle]\)? We can work this out using the definition of the inner product and linearity of (scalar) expectation: \[\begin{align*} \E[\langle \ba, \bX\rangle] &= \E[\sum_i a_iX_i] \\ &= \sum a_i \E[X_i] \\ &= \sum a_i \mu_i \\ &= \langle \ba, \vec{\mu} \rangle \end{align*}\]
Variance is trickier:
\[\begin{align*} \V[\langle \ba, \bX\rangle] &= \V[\sum_i a_iX_i] \\ &= \sum_{i,j=1}^n \C[a_iX_i, a_jX_j] \\ &= \sum_{i,j=1}^n a_ia_j\C[X_i, X_j] \end{align*}\]
Without any more assumptions, we are essentially stuck here statistically, but we can use some linear algebra to clean this up. Define a matrix \(\V[\bX]\) by \[\V[\bX]_{ij} = \C[X_i, X_j]\] This matrix is called the variance (or co-variance) matrix of the random vector \(\bX\). (While covariance and variance are quite different in the random variable context, they are used essentially interchangeably in discussions of vectors.) With this variance matrix, we can write \[\V[\langle \ba, \bX\rangle] = \langle \ba, \V[\bX]\ba \rangle = \ba^T\V[\bX]\ba\] which is quite elegant indeed. Note that, like scalar variance, multiplication gets “squared”, but here we have to be a bit more precise about how both multiplications by \(\ba\) are actually applied.
Practice Problems
Suppose \(Z_1, Z_2\) are independent standard normal random variables. What is the PDF of \(Z_1^2 + Z_2^2\)? Does this distribution have another name?
Let \(X \sim \mathcal{N}(5, 3^2)\). Show that the covariance of \(\bX\) with a constant is 0.
Let \(\bX\) be a random 5-vector following a multivariate normal distribution, where each marginal has a \(\mathcal{N}(\mu, \sigma^2)\) distribution. Suppose further that the correlation of \(X_i\) and \(X_j\) is given by \(\rho^{|i-j|}\). What is the variance of the average of the elements of \(\bX\)?
Let \(\bZ_1, \bZ_2\) be independent standard normal 2-vectors. What is \(\E[\|\bZ_1 - \bZ_2\|^2]\), i.e., the averaged squared distance between them.
Let \(\bZ\) be a standard normal 2-vector and let \(\theta_{\bZ}\) be its angle to the positive part of the real axis. What is the distribution of \(\theta_{\bZ}\)?
Let \(X, Y\) be independent \(\mathcal{N}(\mu, \sigma^2)\) random variables. What is the joint PDF of \(X, Y\)? What is the joint PDF of \(U = X - Y\) and \(V = X + Y\)? Show that \(U, V\) are independent.
Hint: Express \(\binom{U}{V} = \bA \binom{X}{Y}\) for some \(\bA\) and use linear algebra to work out the relevant quantities. This probably shouldn’t require any calculus.
Let \(X_1, \dots, X_k\) be independent \(\mathcal{N}(\mu, \sigma^2)\) random variables. What is \(\E[(\sum_i X_i)^2]\)?
Suppose \(X, Y\) are jointly normal, each having (marginal) mean \(\mu\) and marginal variance \(\sigma^2\). Suppose also that the correlation of \(X, Y\) is \(\rho\). What is \(f_{(X, Y)}(x, y)\)? What is its maximum value?
Show that it is impossible to have three random variables, where each pairwise correlation is \(\rho_{ij} = -2/3\) (\(i \neq j\)).
Hint: What would the variance matrix of these three random variables have to be? Is this a valid variance matrix?
Let \(\bX\) be a multivariate normal distribution with mean 0 and variance \(\Sigma\). What transformation \(\bA\) can we apply to \(\bX\) (what matrix multiplication) so that \(\bA\bX\) has identity covariance? This is known as the whitening transform for \(\bX\) as it turns (correlated) \(\bX\) into “white noise”; it plays an important role in signal processing.
Let \(\bX\) and \(\by\) be the design matrix and response vector for an OLS regression of the form \(y = \hat{\beta}_1x_1 + \hat{\beta}_2x_2\). What is the probability that \(\hat{\beta}_1 > \hat{\beta}_2\) in terms of the true values \((\beta_1^*, \beta_2^*)\), the design matrix \(\bX\), and the noise level \(\sigma^2\)? You may assume each element of \(\by\) satisfies: \(y_i = \beta_1^* x_{1i} + \beta_2^* x_{2i} + \epsilon_i\) where each \(\epsilon_i\) is IID \(\mathcal{N}(0, \sigma^2)\). *(Hint: compute the distribution of \(\hat{\beta} = (\bX^T\bX)^{-1}\bX^T\by\) first.)
Footnotes
This is profoundly not obvious, so don’t worry about it.↩︎