STA 9890 - Fundamentals of ML: III. Accuracy and Loss in ML

Author

Michael Weylandt

Published

March 18, 2026

\[\newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\V}{\mathbb{V}} \newcommand{\P}{\mathbb{P}} \newcommand{\C}{\mathbb{C}} \newcommand{\K}{\mathbb{K}} \newcommand{\Ycal}{\mathcal{Y}} \newcommand{\Xcal}{\mathcal{X}} \newcommand{\Ccal}{\mathcal{C}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Ncal}{\mathcal{N}} \newcommand{\Fcal}{\mathcal{F}} \newcommand{\Ocal}{\mathcal{O}} \newcommand{\Pcal}{\mathcal{P}} \newcommand{\Ucal}{\mathcal{U}} \newcommand{\Dcal}{\mathcal{D}} \newcommand{\bbeta}{\mathbf{\beta}} \newcommand{\bone}{\mathbf{1}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bf}{\mathbf{f}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bA}{\mathbf{A}} \newcommand{\bB}{\mathbf{B}} \newcommand{\bC}{\mathbf{C}} \newcommand{\bD}{\mathbf{D}} \newcommand{\bU}{\mathbf{U}} \newcommand{\bV}{\mathbf{V}} \newcommand{\bI}{\mathbf{I}} \newcommand{\bH}{\mathbf{H}} \newcommand{\bW}{\mathbf{W}} \newcommand{\bY}{\mathbf{Y}} \newcommand{\bK}{\mathbf{K}} \newcommand{\argmin}{\text{arg\,min}} \newcommand{\argmax}{\text{arg\,max}} \newcommand{\MSE}{\text{MSE}} \newcommand{\Tr}{\text{Tr}}\]

Today:

Last week, we showed that OLS can be derived purely as a ‘loss minimization’ problem, without reference to a specific probabilitsic model. Specifically, we argued that given:

OLS is the natural and unique algorithm to use. This was a special case of empirical risk minimization: minimizing our loss on the training data. This gives us a general principle for building ML systems: once we define the loss, we can just throw it at a sufficiently interesting set of functions and pick the minimizer as our predictor. (We will return to the set of functions at a later point.)

This suggests that it’s worth spending some time thinking about what makes a good loss function.

Bias-Variance Decomposition

Mathematically, we have observations of the form \(Y = f_*(X) + \epsilon\) where \(f_*\) is some unknown function (the “true regression function”). Since we are estimating the best approximant in a set of functions (e.g., the best possible line), let’s call that \(\tilde{f}\) and our estimate thereof as \(\hat{f}\). To summarize:

\[\begin{align*} f_* &= \text{True Regression Function - Fixed} \\ \tilde{f} &= \text{Best Possible Guess - Fixed} \\ \hat{f} &= \text{Actual Guess Given Limited Data - Random} \end{align*}\]

Note that, in average over training sets, \(\E[\hat{f}] = \tilde{f}\), but \(\E[\hat{f}] \neq f_*\). The systematic error between \(f_*\) and \(\tilde{f}\) is not due to data-randomness.

Working these all out, we need to look into the expected error taking expectation over the (random) training data \(\Dcal\) and the random noise in the test observation, \(\epsilon\):

\[\begin{align*} \E_{\Dcal, \epsilon}[(Y - \hat{Y})^2] &= \E_{\Dcal, \epsilon}[(f(x) + \epsilon - \hat{f}(x))^2] \\ &= \E_{\Dcal, \epsilon}[(f(x) - \hat{f}(x) + \epsilon)^2] \\ &= \E_{\Dcal, \epsilon}[(f(x) - \hat{f}(x))^2 + 2\epsilon(f(x) - \hat{f}(x))+ \epsilon^2] \\ \\ &= \E_{\Dcal, \epsilon}[(f(x) - \hat{f}(x))^2] + 2\E_{\Dcal, \epsilon}[\epsilon(f(x) - \hat{f}(x))]+ \E_{\Dcal, \epsilon}[\epsilon^2] \\ \\ &= \underbrace{\E_{\Dcal}[(f(x) - \hat{f}(x))^2]}_{\text{No Dependence on $\epsilon$}} + 2\underbrace{\E_{\Dcal, \epsilon}[\epsilon(f(x) - \hat{f}(x))]}_{\text{Product of Two Independent Terms}}+ \underbrace{\E_{ \epsilon}[\epsilon^2]}_{\text{No Dependence on $\Dcal$}} \\ &= \E_{\Dcal}[(f(x) - \hat{f}(x))^2] + 2\underbrace{\E_{\epsilon}[\epsilon]}_{\text{No Dependence on $\Dcal$}} * \underbrace{\E_{\Dcal}(f(x) - \hat{f}(x))]}_{\text{No Dependence on $\epsilon$}}+ \E_{ \epsilon}[\epsilon^2] \\ &= \E_{\Dcal}[(f(x) - \hat{f}(x))^2] + 2 * 0 *\E_{\Dcal}(f(x) - \hat{f}(x))]+ \underbrace{\sigma^2}_{\text{Irreducible Error}} \\ \end{align*}\]

So now let’s just take apart that first term. Note that, per above, the only random thing left is \(\hat{f}\):

\[\begin{align*} \E_{\Dcal}[(f(x) - \hat{f}(x))^2] &= \E_{\Dcal}[(f(x) - \tilde{f}(x) + \tilde{f}(x) - \hat{f}(x))^2] \\ &= \E_{\Dcal}[(f(x) - \tilde{f}(x))^2 +2(f(x) - \tilde{f}(x))(\tilde{f}(x) - \hat{f}(x)) +(\tilde{f}(x) - \hat{f}(x))^2] \\ &= \E_{\Dcal}[(f(x) - \tilde{f}(x))^2]+2\E_{\Dcal}[((f(x) - \tilde{f}(x))(\tilde{f}(x) - \hat{f}(x))] +\E_{\Dcal}[((\tilde{f}(x) - \hat{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+2((f(x) - \tilde{f}(x))\E_{\Dcal}[(\tilde{f}(x) - \hat{f}(x))] +\E_{\Dcal}[((\tilde{f}(x) - \hat{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+2((f(x) - \tilde{f}(x))(\hat{f}(x) - \hat{f}(x)) +\E_{\Dcal}[((\tilde{f}(x) - \hat{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+2((f(x) - \tilde{f}(x)) * 0 +\E_{\Dcal}[((\tilde{f}(x) - \hat{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+\E_{\Dcal}[((\tilde{f}(x) - \hat{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+\E_{\Dcal}[((\hat{f}(x) - \tilde{f}(x))^2] \\ &= (f(x) - \tilde{f}(x))^2+\E_{\Dcal}[((\hat{f}(x) - \E[\hat{f}](x))^2] \\ &= \text{Bias}^2 + \text{Variance} \end{align*}\]

Putting everything together, we have

\[\begin{align*} \E_{\Dcal, \epsilon}[(Y - \hat{Y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2\end{align*}\]

So the ‘two term’ formula is just the second derivation for the MSE of \(f - \hat{f}\), while the three-term formula is a combination of the two and gives the MSE for \(Y - \hat{Y}\) .