STA 9890 - Research Reports

In lieu of traditional homework, STA 9890 has three “research reports” (one for each unit of the course). These research reports are intended to help you develop your skills in the computational and methodological aspects of Statistical Machine Learning. Specifically, these reports are designed to help you learn to deal with novel machine learning developments published as academic research papers, to assess whether so-called developments are actually able to do what they claim, to determine whether a proposed method can actually solve the problem you care about and, if appropriate, to call shenanigans on the ‘puffery’ that is pervasive in the ML literature.

Each Research Report must be submitted as a fully-typed PDF using the course Brightspace. (No handwritten work will be graded.) Each report must include all code used and should have several figures. Reports should be 6-8 pages, ~~double~~ double or single spaced in a legible 10 to 12 point font. You may include appendices for things like code or longer proofs, but the ``main body’’ of your submission must cover the expected material.

Research Reports

Research Report #01: Bias and Variance in Linear Regression

Due Dates:

Released to Students: 2025-02-04
Submission Deadline: 2025-03-07 11:45pm ET

In Research Report #01, you will dig into the oft-cited claim that Ordinary Least Squares is a Best Linear Unbiased Estimator (BLUE). In classical statistics, the BLUE property is often used as an argument of optimality, implying that we can’t beat OLS, so we shouldn’t even try. As you will see, this optimality of OLS is quite overstated: OLS can be beaten quite easily whenever its assumptions are violated, whenever non-linear estimators are allowed, or whenever bias is permitted (taking “Best” to mean “minimum MSE” instead of “minimum variance”).

These findings may seem a bit abstract, but they get at the heart of almost every method and principle we will cover in this course. In this project, in addition to getting a better understanding of what BLUE does and does not mean, you will learn to:

implement gradient descent methods
design Monte Carlo simulations to assess bias and variance
find optimal values of tuning parameters using cross-validation

Research Report #02: Ensemble Learning Techniques for Fair Classification

Due Dates:

Released to Students: 2025-03-11
Submission Deadline: 2025-04-18 11:45pm ET

In Research Report #02, you will apply some of the tools we have developed to the problem of fairness in machine learning (FairML). While not a core topic for this course, this exercise is useful to see how the core idea of this course–regularization, optimization, etc.–can be applied to interesting and novel questions. In this project, you will also engage critically with a newly proposed ML method and investigate i) whether it truly does what it claims to; ii) whether it can be efficiently and reliably implemented; and iii) the degree to which (if any!) it solves your problem of interest. As working Data Scientists and Business Analysts, you may not think of yourselves as researchers, but knowing how to read and critically evaluate cutting-edge work will let you maintain and enhance your skills throughout your career.

Research Report #03: Sparse Principal Components Analysis

Due Dates:

Released to Students: 2025-04-22
Submission Deadline: 2025-05-09 11:45pm ET

In Research Report #03, you will explore sparse PCA and apply it to a data set of interest. As you do so, you will see how a modern machine learning principle (sparsity) can be used to improve a classical statistical technique like PCA to get ‘the best of both worlds.’ Because our focus here is on an unsupervised method, this report should be careful to consider interpretation and validation of the resulting PCs, as standard validation techniques for supervised methods cannot be applied.