STA 9890 Research Report #01: Bias and Variance in Linear Regression

Key Dates

Released to Students: 2025-02-04
Submission: 2025-03-07 11:45pm ET on Brightspace

Estimated Time to Complete: 9 Hours

Research Report #01: Bias and Variance in Linear Regression

In this research report, you will dive deeply into the bias-variance trade-off and the BLUE-ness (or lack thereof) of ordinary least squares.

Project Skeleton

Your report should be divided into four sections, covering the following:

Theoretical background
Computation - Gradient Descent and Weight Decay
Bias and Variance Under Linear Data Generating Processes (DGP)
Bias and Variance Under Non-Linear DGP

At a minimum, these should include the following elements:

Theoretical Background
- Rigorous Statement and Proof of the Bias-Variance Decomposition
  
  \[\mathbb{E}[\text{MSE}] = \text{Bias}^2 + \text{Variance} (+ \text{Irreducible Noise})\]
  
  Discuss how this statement is to be interpreted in the context of parameter estimation (when the underlying DGP is linear) and in the context of (possibly misspecified) prediction. Be sure to clarify when the “Irreducible Noise” term is needed.
- Proof of the “BLUE” property of OLS with clear statement of the relevant assumptions. Take care to differentiate which assumptions are required for “B” and which are required for “U”.
Computation
- Derivation and implementation of the ‘closed-form’ matrix expression for OLS
- Derivation and implementation of the ‘closed-form’ matrix expression for Ridge Regression
- Derivation and implementation of a gradient descent method for OLS
- Derivation and implementation of a ‘weight decay’ modification for OLS gradient descent.
- Empirical demonstration of the equivalance of OLS+Weight Decay with Ridge Regression
Bias and Variance Under Linear DGP
- Posit a Linear DGP
- Using Monte Carlo simulations:
  - Show that OLS is unbiased under this DGP
  - Compute the variance of a given regression coefficient
  - Compute the in- and out-of-sample prediction MSE
  - Show how the variance, in-sample, and out-of-sample MSE change with the sample size, \(n\)
- Compare against ridge regression
  - Show that RR is biased under this DGP
  - Compare the variance of OLS and RR
  - Demonstrate the MSE Existence Theorem for Ridge Regression
  - Compare the MSE of RR and OLS as a function of the sample size \(n\)
Bias and Variance Under Non-Linear DGP
- Posit a Non-Linear DGP
- Using simulation, determine the “best approximate” linear regression coefficients: that is, find the OLS coefficients minimizing MSE
- Show that the estimated OLS coefficients converge to the “best approximate” coefficients as the sample size \(n\) increases
- Compute the MSE of RR in simulation and then use 5-fold cross-validation to determine the optimal regularization level
- Compare the MSE of RR and OLS in the non-linear setting

Finally, note that because we are interested in bias, variance, and MSE – each of which is defined with respect to a (known) DGP – we only use statistical simulation for this project. In Report #02 and Report #03, you will apply methods to real (non-simulated) data.

Additional Background

Gradient Descent methods fit models by taking small steps in the direction opposite the gradient of the loss function. (Recall the gradient points in the direction of steepest increase, so moving in the opposite direction leads to a decrease in error.)

Abstractly, given a differentiable loss function, \(\mathcal{L}(\beta)\), gradient descent proceeds as follows:

Select initial guess \(\beta^{(0)}\) and set \(k=0\)
Repeat until convergence:

Compute gradient \(\nabla\mathcal{L}|_{\beta = \beta^{(k)}}\)
Update guess using a gradient step \[\beta^{(k+1)} = \beta^{(k)} - c \nabla\mathcal{L}|_{\beta = \beta^{(k)}}\]
Increment \(k := k+1\)
Check convergence:
- Parameter convergence: if \(\beta^{(k+1)} \approx \beta^{(k)}\), we have converged.
- Objective convergence: if \(\mathcal{L}(\beta^{(k+1)}) \approx \mathcal{L}(\beta^{(k)})\), we have converged

At convergence, return \(\beta^{(k+1)}\)

This scheme is pretty easy to implement, but there are a few points where it behooves you to be careful:

How do we check the approximate equality in the convergence check?
We will almost never get exact equality, so you need to pick a norm and a tolerance.
How do we choose the step-size \(c\)? Much has been written about this; for our purposes it suffices to take a very small value of \(c\) and keep it constant.
What if we never reach convergence? Should we stop after some (large) number of iterations?

The method of weight-decay modifies the gradient update as follows:

\[\beta^{(k+1)} = \beta^{(k)} - c \nabla\mathcal{L}|_{\beta = \beta^{(k)}} - \omega \beta^{(k)}\]

Here \(\omega\) is another small constant scalar value. By subtracting off a fraction of \(\beta^{(k)}\) at each step, weight decay prevents the estimate from ever getting to be too big (hence the name). In the case of OLS, this can be shown to be equivalent to a form of ridge regression.¹

The MSE Existence Theorem for Ridge Regression comes in two forms:

Estimation Variant. Let \[\text{MSE}_i(\hat{\beta}) = \mathbb{E}_{\mathbf{X}, \mathbf{y}}[(\hat{\beta_i} - \beta_i^*)^2]\] denote the mean squared estimation error in the \(i^\text{th}\) component of \(\hat{\beta}\) as an estimator of the true regression coefficients \(\beta^*\). Here the expectation is taken over random realizations of both the design matrix \(\mathbf{X}\) and the response vector \(\mathbf{y}\), as well as the estimator \(\hat{\beta}\) which is a function of \(\mathbf{X}, \mathbf{y}\).

Then, there exists some positive value \(\lambda > 0\) such that

\[\text{MSE}_i(\hat{\beta}_{\text{OLS}}) > \text{MSE}_i(\hat{\beta}_{\text{RR}, \lambda})\]

where \(\hat{\beta}_{\text{RR}, \lambda}\) is the ridge regression estimate with regularization parameter \(\lambda\).
Prediction Variant. Let \[\text{MSE}(\hat{\beta}) = \mathbb{E}_{\mathbf{X}, \mathbf{y}, \mathbf{y}'}[\|\mathbf{X}\hat{\beta} - \mathbf{y}'\|_2^2\] denote the mean squared prediction error associated with the estimator \(\hat{\beta}\). Here the expectation is taken over random realizations of both the design matrix \(\mathbf{X}\) and the response vector \(\mathbf{y}\) used for training, as well as the test data vector \(\mathbf{y}'\) which must be drawn from the same DGP.

Then, there exists some positive value \(\lambda > 0\) such that

\[\text{MSE}_i(\hat{\beta}_{\text{OLS}}) > \text{MSE}_i(\hat{\beta}_{\text{RR}, \lambda})\]

where \(\hat{\beta}_{\text{RR}, \lambda}\) is the ridge regression estimate with regularization parameter \(\lambda\).

It is important to note that both of these are statements in expectation: OLS may perform better than RR on a single realization, but in average - over many realizations - suitably-tuned RR will perform better.

Also, we note that the MSE Existence Theorem does not give any practical advice on selecting \(\lambda\); we have to fall back on our standard approaches, such as cross-validation, for doing so.

Finally, note that the prediction version of the MSE Existence Theorem only holds when we look at test (out-of-sample) prediction accuracy. OLS is the optimal linear method for in-sample MSE always and forever.

Possible Topic(s) for Additional Research

Recent ML research has focused on the case of overparameterized models: that is, models with more parameters than data points used for training. We have already seen ridge regression as one way to deal with this problem. Another approach is to simply use an iterative method, such as gradient descent, and stop it when the training error reaches zero. (This is a non-unique global minimum for the convex OLS problem.) How does this procedure compare? Does it complicate our understanding of the bias-variance trade-off?

Another recent line of research generalizes the “BLUE” property, showing that OLS is in an appropriate sense “BUE” - it is still optimal even if we allow for (certain types of) non-linear estimators as well. Can you show this?

Submission Instructions

Submit your research report as a PDF of 6-8 pages on Brightspace.² Your submission should include all essential code, e.g., code used to fit models or to perform simulations, but may omit incidental code, e.g. code used to create figures or import data.

You are required to implement ML methods by hand; use of ‘built-in’ ML functionality is disallowed unless authorized by instructor. You may use built-in linear algebra subroutines and are not required to, e.g., implement matrix multiplication from scratch. While you must use your own implementation of ML methods, it is smart to compare against existing ‘reference’ implementations.

Your report should be in the style of a technical “blog post”, accessible to someone who is just getting into Machine Learning but has solid background in the prerequisite subjects (e.g., you on the day before this class started). Make sure to (rigorously) define and prove all results used. You should also cite relevant authoritative resources, e.g., the recommended textbooks, where appropriate.

Each student must submit an individual and independently written report, but you are encouraged to work together with your peers to understand methods, design simulations, review relevant literature, etc. I strongly recommend having a classmate “red-team” your report before submission to find unclear writing, sub-standard figures and tables, unconvincing simulations, incorrect code, etc. Per the course’s Academic Integrity Policy, you must explicitly acknowledge all collaborators in your submission.

Grading

Your submission will be evaluated holistically and graded out of 100 points. The following rubric will guide this holistic evaluation, but the instructor may deviate as necessary to accurately grade the final submission.

Report Element	Excellent. “A-” to “A” (`90% to 100%`)	Great. “B-” to “B+” (`80% to 89%`)	Average. “C” to “C+” (`73% to 79%`)	Poor. “D” to “C-” (`60% to 72%`)	Failure. “F” (`0% to 59%`)
Presentation (15%)	Report has excellent formatting, with particularly effective tables and figures. Tables and Figures are “publication-quality” and clearly and succinctly support claims made in the body text.	Report has strong formatting; tables and figures make their intended points, but do not do so optimally.	Formatting is average; tables and figures do not clearly support arguments made in the text and/or are not “publication quality”	Poor formatting distracts from substance of report. Tables and Figures exhibit significant deficiencies in formatting.	Formatting prohibits or significantly impairs reader understanding.
Project Skeleton (15%)	Report includes all required elements and goes significantly deeper than required by the project skeleton in a manner that provides additional insights. Mathematical definitions and proofs are clearly stated.	Report includes all required elements and dives deeper into the topic, but does not generate additional insights. (E.g., additional simulations showing the same phenomena) Mathematical definitions and proofs are correctly stated, but clarity could be improved.	Report includes all required elements. Mathematical definitions and proofs are essentially correct, but difficult to understand and/or contain trivial errors.	Report does not include all required elements, but is still able to capture key points. Mathematical definitions and proofs contain significant, but non-fatal, errors.	Report fails to adequately address the topic. Mathematical definitions and proofs are fundamentally incorrect.
Algorithms and Computational Efficiency (20%)	Report implements all methods efficiently with high-quality, well-formatted and performant code.	Report implements all methods efficiently with acceptable code quality.	Report implements all methods, but does not do so efficiently and/or with substandard code.	Report uses built-in methods instead of implementing methods from scratch	Code does not appear to run properly / insufficient code submitted.
Methodological Aspects and Simulation Design(20%)	Methods are accurately and correctly implemented in a robust / bullet-proof manner, designed to responsibly check for and address possible modes of failure. Simulations clearly and efficiently support all claims.	Methods are accurately and correctly implemented, but are not robust to failure. Simulations clearly and efficiently support all claims.	Methods are implemented with minor harmless errors and/or poor validation. Simulations do not fully support claims.	Methods are implemented with significant errors leading to incorrect results. Simulations do not give sufficient support for key claims.	Methods are not correctly implemented. Simulations do not support claims.
Communication and Writing (20%)	Report exhibits excellent written communication, making all points exceptionally clearly and at the level of a quality academic publication. Code, results, and text are skillfully woven together.	Report exhibits great written communication, all points are made clearly without notable grammatical errors. Code, results, and text are neatly tied together.	Report exhibits solid written communication, key points are made understandably and any grammatical errors do not impair understanding. Code, results, and text could be better integrated, but it is clear which elements relate.	Written communication is below standard: points are not always understandable and/or grammatical errors actively distract from content. Code, results, and text are not actively integrated, but are generally located ‘near’ each other in a semi-systematic fashion.	Written communication is far below standard, possibly bordering on unintelligible. Large blocks of code are shown without any relevant results or text.
Contextualization (10%)	Report draws on relevant, modern research literature to give context to its findings and to broaden its scope.	Report draws on standard textbooks and/or classical research literature to give context to its findings and to broaden its scope.	Report cites textbooks to give context, but does not take a ‘big picture’ view.	Report gives appropriate context, but lacks citations.	Report gives limited, insufficient context.

Footnotes

A somewhat remarkable finding of much recent research is that many ‘new’ regularization methods pioneered for fitting neural networks have essentially the same effect as ridge (\(\ell_2^2\) or Tikhonov) regularization: weight decay, drop out, early stopping, mini-batching and mini-patching. Sometimes it’s hard to beat the classics.↩︎
If you wish to use an alternate submission format to allow interactive and/or animated elements, please seek pre-approval from the instructor.↩︎