STA 9890 - Non-Linear Regression

This week, we continue our discussion of ‘fancy’ regression from last week to consider non-linear models. While we introduce several methods for non-linear regression, we also pause to consider when non-linear methods can actually be expected to outperform linear methods.

Also note that we are arguably being a bit ‘sloppy’ about what we mean by ‘linear’ methods. For purposes of these notes, we define a ‘linear’ method as one where the fitted model has a linear response to changes in the inputs; formally, one where

yxi=βi (a constant) for all i{1,,p}

This definition includes things like best-subsets and lasso regression, which are not linear functions of y, but excludes things like polynomial regression. In the above definition, each βi is essentially a regression coefficient, and values of i where βi=0 are non-selected variables if we are using a sparse method.

Non-Linearity via Linear Combination of Non-Linear Parts

We have spent the previous few weeks developing a set of tools for building linear models of y. Specifically, we have sought ways to approximate E[y|x] as a weighted sum of each xi:

E[y|x]x,β^=i=1pxiβ^i

For situations where the relationship is nearly (or even truly) linear, this approximation can be quite useful and good estimates of the βi coefficients can lead to accurate estimates of E[y|x], and ultimately accurate prediction of test set response y. Using the bias-variance decomposition, we have seen that it is often worthwhile to accept a bit of bias in estimation of each βi if it is compensated by a worthwhile reduction in the variance of βi.

We can further refine this equality by decomposing bias a bit further:

MSE=Irreducible Error+Variance+Model Bias2+Estimation Bias2

Here, we have decomposed Bias2 into two terms:

  • Model Bias2: a measure of the systematic error arising from use of a linear model to predict a non-linear DGP
  • Estimation Bias2: a measure of the systematic error arising from use of a regularized estimation procedure which exhibits shrinkage

Put another way, “Model Bias” results from use of a linear model when something non-linear should be used, while “Estimation Bias” arises from the use of a biased method (like ridge regression) instead of something nominally unbiased like OLS. For the previous two weeks, we have mainly focused on linear models for linear DGPs, so model bias has been zero. If we expand our gaze to non-linear DGPs, we have to deal with model bias a bit more directly. As with estimation bias, it is frequently worthwhile to accept model bias to reduce variance; see more discussion below.

Polynomial Expansion

You have likely already seen polynomial expansion (or polynomial regression) in previous courses. Essentially, PE fits a low(-ish)-order polynomial instead of a line to data. More formally, let f(x)E[y|x] be the regression function (best possible predictor) that week seek to approximate from noisy observations. Standard linear regression essentially seeks to fit a first-order Taylor approximation of f around some point x:

f(x)f(x)+f(x0)(xx)=f(x)+i=1pfxi|x=xi(xixi)

If we rearrange this a bit, we get

f(x)f(x)+i=1pfxi|x=xi(xixi)=(f(x)i=1pfxi|xi=xixi)=β0+i=1pfxi|xi=xi=βixi=β0+i=1pxiβi

so we see that the regression coefficients are more-or-less the (suitably-averaged) partial derivatives of f while the intercept is the value of f at the center of our expansion (f(x)) plus some differential adjustments. Note that, if the variables X are centered so that E[X]=0, the intercept is exactly the average value of f, as we would expect.

Clearly, this linear approximation will do better in the situations where Taylor expansions generally perform better: when i) the derivative of f is roughly constant and ii) the higher order terms are small. This makes sense: if f is very-close-to-linear, there is minimal loss from fitting a linear approximation to f; if f is very non-linear, however, a linear approximation can only be be so good, and we are saddled with significant model bias.

If we want to mitigate some of that model bias, we can choose to use a higher-order Taylor series. If we examine a second order Taylor series, we get a similar approximation as before, where H denotes the Hessian matrix of second derivatives:

f(x)f(x)+f(x0)(xx)+12(xx)Hf(x)(xx)

After some rearrangement this becomes:

f(x)β0+i=1pfxixi+i=1p2fxi2xi2+i,j=1ijp2fxixjxixj

We recognize the constant term (intercept) and first order terms (linear slopes) from before, but now we have added two types:

  • Second order terms (xi2): these capture higher order non-linearities in a single variable. For example, if y=x2, then the ‘true fit’ really lies in the second order term 2f/x2=2 instead of any linear approximation
  • Cross terms (xixj): these capture non-linear (non-additive) multi-variate relationships between features. You may have seen these before as interaction terms.

Rules of thumb vary as to which set of terms are more important. Historically, statisticians have tended to put in the interaction (cross) terms first, though this is far from universal practice. In particular, when dealing with ‘binary’ variables, the higher order term is unhelpful: if a feature is 0/1 (did the patient receive the drug or not?), adding a squared term has no effect since 02=0 and 12=1.

Regardless, this is why we sometimes use ‘polynomial expansion’ of our original variables. We are trying to capture the higher-order (non-linear) terms of the Taylor approximation.

It is worth thinking about what happens when we add higher-order terms to our model. In particular, note that we i) pick up some correlation among features; and ii) we will generally have more variance since we have more features. In practice, the correlation isn’t too much of a problem and the actual polynomials fit by, e.g., poly in R are modified to remove correlation. The variance however can be a more significant problem.

Let’s see this in action. Let’s first consider fitting polynomial regression to a simple (univariate) non-linear function: f(x)=|x|3+5cos(x)

curve(sqrt(abs(x)^3 + 5) * cos(x), from=1, to=10)

This is not too non-linear. It’s still very smooth (analytic in the jargon) and should be relatively easy to fit. Let’s also generate some noisy observations of this function.

n <- 25
# Linear regression doesn't care about the order of our data, but plotting does
# so sorting gives better visuals
x <- sort(runif(n, 2, 10)) 
Ey <- sqrt(abs(x)^3 + 5) * cos(x)
y <- Ey + rnorm(n, sd=3)


x_grid <- seq(min(x), max(x), length.out=501)
Ey_grid <- sqrt(abs(x_grid)^3 + 5) * cos(x_grid)

plot(x, y)
lines(x_grid, Ey_grid)

We can now fit some linear models to this data (shown in color): we use R’s built-in poly function to automatically create the polynomials we seek. You can think of poly(x, k) as creating k features x1,x2,,xk but it actually does something a bit more subtle under the hood to make model fitting a bit more numerically stable.

plot(x, y)
lines(x_grid, Ey_grid)

for(k in 1:5){
    m <- lm(y ~ poly(x, k))
    
    yhat <- predict(m, data.frame(x = x_grid))
    lines(x_grid, yhat, col=k+1)
}

legend("bottomleft",
    c("True Regression Function", 
         "First Order (Linear) Fit",
         "Second Order (Quadratic) Fit", 
         "Third Order (Cubic) Fit", 
         "Fourth Order (Quartic) Fit",
         "Fifth Order (Quintic) Fit"), 
       col=1:6, lty=1)

Clearly, as we increase the order of the polynomial, we get a better (in-sample) fit. This makes sense: we know that more features gives us a better fit all else being equal, so what’s the issue?

As usual, the issue occurs for extrapolation. You can actually see some issues already beginning to manifest at the ends of our prediction intervals: the higher-order polynomials go radically different directions as we extrapolate and even the ‘best fit’ (quintic) looks like it’s going to be too steep as we go further.

Let’s expand the prediction region for these fits:

x_pred <- seq(-5, 15, length.out=501)
Ey_pred <- sqrt(abs(x_pred)^3 + 5) * cos(x_pred)
plot(x_pred, Ey_pred, type="l")
points(x, y)

for(k in 1:5){
    m <- lm(y ~ poly(x, k))
    yhat <- predict(m, data.frame(x=x_pred))
    lines(x_pred, yhat, col=k+1)
}

legend("bottomleft",
    c("True Regression Function", 
         "First Order (Linear) Fit",
         "Second Order (Quadratic) Fit", 
         "Third Order (Cubic) Fit", 
         "Fourth Order (Quartic) Fit",
         "Fifth Order (Quintic) Fit"), 
       col=1:6, lty=1)

There’s some problems!

To be fair, it is a bit unreasonable to expect these models to perform too well outside of our original sampling area. What’s more worrying is that the extrapolations don’t just miss the curves of the true regression function, they actually create their own even more intense wiggles.

We can also manifest this without extrapolation by fitting very high-order polynomials:

k <- 9
m <- lm(y ~ poly(x, degree=10 + k, raw=TRUE))
yhat <- predict(m, data.frame(x = x_grid))
plot(x, y)
lines(x_grid, Ey_grid)

for(k in 1:5){
    m <- lm(y ~ poly(x, degree=14 + k, raw=TRUE))
    yhat <- predict(m, data.frame(x = x_grid))
    lines(x_grid, yhat, col=k)
}

Note that we have to use raw=TRUE here to avoid R complaining about having too-high a degree in the polynomial. (R gives an error about “unique points” but the real issue is one about matrix rank, not ties in x.) I’m also surpressing some warnings here about a sub-standard fit.

So how can we address this ‘over-wiggliness’?

As we discussed last time, we need to use a different set of functions: one that is smooth (like polynomials), but not too-high order. It would also be really nice if we could get something ‘adaptive’ - allowing for more wiggliness where we have more data (and we need more wiggliness) and less wiggliness where we don’t have enough data (fall-back to linearity). ### Local Linear Models

One way to do this is the idea of “local linear (polynomial) models.” Instead of fitting a single (global) line, we can fit different lines in different parts of our data: that way, we can get an upward line when the true (non-linear) relationship is increasing and a downward line when the true relationship is decreasing. Or at least that’s the hope!

plot(x, y)
lines(x_grid, Ey_grid)

# Divide our data into five 'buckets' and fit sub-models
# This works because we've already sorted our data
# (What would happen if we hadn't?)
for(k in 1:5){
    xk <- x[(5 * k - 4):(5*k)]
    yk <- y[(5 * k - 4):(5*k)]
    mk <- lm(yk ~ xk)
    y_hat <- predict(mk)
    lines(xk, y_hat, col="red4")
}

Definitely some rough edges here - particularly in the areas between our buckets, but we’re on a good path. The locfit package will help us with the details:

library(locfit)
locfit 1.5-9.11      2025-01-27
m <- (locfit(y ~ lp(x)))
summary(m)
Estimation type: Local Regression 

Call:
locfit(formula = y ~ lp(x))

Number of data points:  25 
Independent variables:  x 
Evaluation structure: Rectangular Tree 
Number of evaluation points:  7 
Degree of fit:  2 
Fitted Degrees of Freedom:  5.194 
plot(x, y)
lines(x_grid, Ey_grid)
y_hat <- predict(m, data.frame(x = x_grid))
lines(x_grid, y_hat, col="red4")

Not perfect, but pretty nice compared to what we had before.

You can also see a discussion of the “degrees of freedom” in the model output. This is not exactly the DoF you learned in earlier classes, but it’s sort of “morally equivalent.” This local fit is about as flexible as a 4th degree polynomial would be for this problem. Even though this model is made out of quadratics, it’s more flexible than a single quadratic. But we avoid the extreme variability associated with a polynomial of that high order. Win-win!

We can tweak some parameters of the local fit to get different responses: the big ones are the degree (deg) and the number of neighbors to use.

?lp
m <- (locfit(y ~ lp(x, deg=4)))
summary(m)
Estimation type: Local Regression 

Call:
locfit(formula = y ~ lp(x, deg = 4))

Number of data points:  25 
Independent variables:  x 
Evaluation structure: Rectangular Tree 
Number of evaluation points:  7 
Degree of fit:  4 
Fitted Degrees of Freedom:  6.986 
plot(x, y)
lines(x_grid, Ey_grid)
y_hat <- predict(m, data.frame(x = x_grid))
lines(x_grid, y_hat, col="red4")

Pretty nice. If we ‘turn down’ the number of neighbors used:

m <- (locfit(y ~ lp(x, nn=0.2, deg=4)))
Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth =
geth, : Estimated rdf < 1.0; not estimating variance
summary(m)
Estimation type: Local Regression 

Call:
locfit(formula = y ~ lp(x, nn = 0.2, deg = 4))

Number of data points:  25 
Independent variables:  x 
Evaluation structure: Rectangular Tree 
Number of evaluation points:  27 
Degree of fit:  4 
Fitted Degrees of Freedom:  24.257 
plot(x, y)
lines(x_grid, Ey_grid)
y_hat <- predict(m, data.frame(x = x_grid))
lines(x_grid, y_hat, col="red4")

Too far - super crazy! But not entirely unexpected - we know that K-Nearest Neighbors for small K has a huge variance.

Too far!

The loess function in base R does this very nicely as well without requiring additional packages. For practical work, it’s a very nice tool for univariate modeling.

m <- loess(y ~ x)
summary(m)
Call:
loess(formula = y ~ x)

Number of Observations: 25 
Equivalent Number of Parameters: 4.64 
Residual Standard Error: 4.098 
Trace of smoother matrix: 5.1  (exact)

Control settings:
  span     :  0.75 
  degree   :  2 
  family   :  gaussian
  surface  :  interpolate     cell = 0.2
  normalize:  TRUE
 parametric:  FALSE
drop.square:  FALSE 
plot(x, y)
lines(x_grid, Ey_grid)
y_hat <- predict(m, data.frame(x = x_grid))
lines(x_grid, y_hat, col="red4")

Spline Regression (Additive Models)

Splines

We can generalize this idea a bit using splines. Splines are functions that solve a penalized approximation problem:

f^=arg minf12nyf(x)22+λ|f(x)|2dx

Here we are saying we’ll take any function (not just a polynomial) that achieves the optimal trade-off between data fit (first / loss term) and not being too rough (second / penalty term).

Why is this the right penalty to use? In essence, we are trying to penalize ‘deviation from linearity’ and since linear functions have second derivative 0 (by definition) the integral of the second derivative gives us a measure of non-linearity.

In some remarkable work, Grace Wahba and co-authors showed that the solutions to that optimization problem are piecewise polynomials with a few additional constraints - these functions are splines. In addition to piecewise polynomialness, splines also guarantee:

  • continuity of the function and its derivatives
  • linearity outside the range of data fit (‘natural splines’)

Because splines not only match the value of the function at the knots (places where the ‘pieces’ match up) but also the derivatives, they are very smooth indeed.

Stepping back, splines are just a different sort of feature engineering. Instead of using polynomial basis functions (x1,x2,,xk), splines use a much smoother basis and hence give smoother results, avoiding the ‘wiggles’ problem we saw above.

R provides the basic tools for spline fitting in the splines package. For real work, you almost surely want to use the advanced functionality of the mgcv package. For more on splines and models using them (‘additive models’), see this online course.

We can see that individual splines are quite nice little functions and usually are only non-zero on small regions:

library(splines)

nsx <- ns(x_grid, df=5) # The natural spline basis
plot(x_grid, apply(nsx, 1, max), ylim=range(nsx), type="n")

for(i in 1:5){
    lines(x_grid, nsx[,i], col=i)
}

We can use linear combinations of these ‘bumps’ to fit non-linear functions, including our example from above:

m <- lm(y ~ ns(x, df=5))
plot(x, y)
lines(x_grid, Ey_grid)
y_hat <- predict(m, data.frame(x = x_grid))
lines(x_grid, y_hat, col="red4")

(This basically matches what we had before with DoF4.7, but you can get different answers by messing with the df parameter here.)

Our predicted response is actually the sum of the individual spline effects:

m <- lm(y ~ ns(x, df=5))
y_hat <- predict(m, data.frame(x = x_grid))

plot(x, y, lwd=2)
lines(x_grid, Ey_grid)
for(i in 1:5){
    lines(x_grid, ns(x_grid, df=5)[,i] * coef(m)[-1][i], col=i+1)
}

If we were able to sum up the colored lines, we would get the black line back.

We can also see that natural splines give smooth predictions outside the range of the original data:

y_pred <- predict(m, data.frame(x = x_pred))

plot(x_pred, Ey_pred, type="l")
lines(x_pred, y_pred, col="red4")

Admittedly, not ideal, but it will do. In particular, note that outside of the ‘data region’, we fall back on a nice comfortable

(You can repeat this process with the slightly more common b-splines, but the differences aren’t huge.)

m <- lm(y ~ bs(x, df=5))
y_hat <- predict(m, data.frame(x = x_grid))

plot(x, y, xlim=range(x_grid))
lines(x_grid, Ey_grid)
for(i in 1:5){
    lines(x_grid, bs(x_grid, df=5)[,i] * coef(m)[-1][i], col=i+1)
}

y_pred <- predict(m, data.frame(x = x_grid))

lines(x_grid, y_pred, col="red4", lwd=2)

For a ‘full-strength’ version of spline fitting, you can use the mgcv package:

library(mgcv)
Loading required package: nlme
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
m <- gam(y ~ s(x))

summary(m)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.2852     0.5974  -7.173  1.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
       edf Ref.df     F p-value    
s(x) 5.954  7.038 70.49  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.954   Deviance explained = 96.5%
GCV = 12.359  Scale est. = 8.9212    n = 25
m <- gam(y ~ s(x))
y_hat <- predict(m, data.frame(x = x_grid))

plot(x, y, xlim=range(x_grid))
lines(x_grid, Ey_grid)
lines(x_grid, y_hat, col="red4", lwd=2)

Run ?s to see the many features of the mgcv package. mgcv also includes several specialized splines (e.g. periodic) which you may find useful in some problems.

Splines are not normally consider a ‘machine learning’ tool but they are incredibly useful in applied statistics work. (The two methods you will rely on most from this class are spline regression and random forests.)

1-Filtering

Kernel Methods

Manual feature expansion

We can think of feature expansion as mapping our data to a higher-dimensional space Φ:RpRP and fitting a linear model there. As we have seen above, splines and their kin provide a useful way of constructing the mapping Φ, but we are still constrained to work with a fairly restricted type of problem.

Can we generalize this formally to more interesting maps Φ()? Yes - via kernel methods!

Ridge without coefficients

Before introducing kernel methods formally, let’s look back to ridge regression. We showed that the ridge solution is given by

β^Ridge=(XTX+λI)1(XTy)

Some nifty linear algebra will let us rewrite this as

β^Ridge=λ1XT(XX/λ+I)1y=X(XX+λI)1y

You can prove this using the Woodbury Matrix Identity or we can just check it for a single data set and trust that it generalizes:

n <- 100
p <- 5 

X <- matrix(rnorm(n * p), nrow=n, ncol=p) 

Ey <- X[,1] + sqrt(X[,2]^2 + 5) + 0.1 * X[,3]^3 + cos(abs(X[,4])) + 1/(abs(X[,5]) + 3)
y <- Ey + rnorm(n, sd=sqrt(0.25))

lambda <- 2
eye <- function(p) diag(1, p, p)
beta_hat_ridge <- solve(crossprod(X) + lambda * eye(p), crossprod(X, y))

beta_hat_ridge_alt <- t(X) %*% solve(tcrossprod(X) + lambda * eye(n)) %*% y

cbind(beta_hat_ridge, beta_hat_ridge_alt)
           [,1]       [,2]
[1,] 1.42746826 1.42746826
[2,] 0.15151370 0.15151370
[3,] 0.57121814 0.57121814
[4,] 0.04425193 0.04425193
[5,] 0.58742212 0.58742212

So they are the same! But why did we do this?

It is sometimes a bit more computationally efficient if we have to invert an n×n matrix instead of a p×p matrix, but there are faster methods if we’re really interested in speed.

Note that, if we want to make a ridge regression prediction now for a new x~, we just need to compute:

β^RidgeTx~=[X(XX+λI)1y]Tx~=[yT(XX+λI)TX]x~=yT(XX+λI)1Xx~

Intuitively, recall that the inner product of two vectors measures the angle between them and can, up to some scaling, be used as a ‘similarity’ measure. This result essentially says that our prediction on a new data set is a weighted average of the training data, with weights based on similarity of the new point with the training data. This is not a crazy structure…

If we look closer, we see that we only need to compute products of the form x1x2 to make this work. This is where the magic of kernels lies.

Kernel Trick

Earlier, we considered non-linear regression by feature expansion, where we replaced X by Φ(X), where Φ:RpRP maps to a higher-dimensional space (applied row-wise). We can use this mapping in our new ridge formula to get our non-linear predictions as

y^(x~)=yT(Φ(X)Φ(X)+λI)1Φ(X)Φ(x~)

Longer, but not necessarily better. If P is very large, computing with Φ(X) can be incredibly expensive.

It turns out, however, that we never need to actually form Φ(X), we only need κ(x1,x2)=Φ(x1)TΦ(x2). If we can compute κ directly, we never need to work with Φ.

Functions that allow this are called (Mercer) kernels and they are just a little bit magical.

If we let K=Φ(X)Φ(X) be defined by Kij=κ(xi,xj), then we can write our feature-augmented ridge regression as:

y^(x~)=yT(K+λI)1κ(X,x~)

If we break this apart, we see that our predictions at the new point x~ are essentially just weighted averages of our original observations y, weighted by the similarity between the new point and our training points κ(X,). This intuition is super important and we’ll dig into it further below.

Let’s try out Kernel(ized) Ridge Regression:

library(kernlab)

## Define a kernel function - we'll spell this out below.
rbf <- rbfdot(sigma = 0.05)

## calculate kernel matrix
K <- kernelMatrix(rbf, X)
dim(K)
[1] 100 100

Note that this is a n×n matrix - not a p×p matrix!

We can now use this to make predictions:

krr <- function(x, lambda=1) {
    crossprod(y, solve(K + lambda * eye(n), kernelMatrix(rbf, X, x)))
}

krr_MSE <- mean((y - krr(X))^2)
print(krr_MSE)
[1] 0.4916979

This is better than the OLS MSE:

mean(resid(lm(y ~ X))^2)
[1] 0.5484549

Why? How is this not a contradiction of everything said before.

But how do we actually do kernel multiplication?

There are many kernel functions in the world. The most common are defined by:

  • κ(x1,x2)=x1x2. This is the linear kernel, equivalent to non-kernel methods
  • κ(x1,x2)=(x1x2+c)d. This is the polynomial kernel, equivalent to fitting polynomial regression up to degree d with all cross products. (The role of c is tricky, but it essentially controls the relative weight of the higher and lower order terms)
  • κ(x1,x2)=eσ2x1x22. This is the squared exponential or radial basis function (RBF) kernel; it is very popular in spatial statistics, where it is closely related to a method known as kriging.

While the first two kernels give us a natural Φ() mapping, the Φ() for the RBF kernel is actually infinite dimensional: it lets us fit function classes we could not fit without the kernel trick.

For even more kernels, see:

Much of our current understanding of deep learning is based on specialized kernel methods:

This blog is a particularly nice (but still somewhat technical) introduction to the Neural Tangent Kernel and its relationship to neural networks.

Spline theory has also found use to explain ReLU (piecewise linear activation) neural networks:

We can also kernelize some of the other methods we’ve studied in this course.

Notably, we can kernelize K-Nearest Neighbors to get “KKNN” if we modify the definition of distance used to be “kernel-ish.” We first recall that we can write distances solely in terms of inner products:

x1x222=(x1x2)T(x1x2)=x1Tx1x2Tx1x1Tx2+x2Tx2=x1Tx12x1Tx2+x2Tx2Kernel Distances=κ(x1,x1)2κ(x1,x2)+κ(x2,x2)

If we compute distances this way for a given κ, we get kernel KNN. This is particularly nice when we have kernels that capture specific behaviors (e.g., periodic or embedded categorical) that we can’t really treat as Euclidean.

Trade-Offs between Linear and Non-Linear Methods

Having developed some tools for non-linear regression, let’s step back and ask whether they are worthwhile. Recall our error decomposition:

MSE=Irreducible Error+Variance+Model Bias2+Estimation Bias2

We have already discussed how tools like ridge and lasso let us remove variance at the cost of (estimation) bias.

We argued that this was a worthwhile trade when the problem had a lot of ‘innate’ variance, either from having large noise variance (V[ϵ]0) or from having many features. In particular, use of some sort of shrinkage (ridge or lasso penalization) was essential in the ‘high-dimensional’ case (p>n) where OLS had so much variance that it was not even uniquely defined. We also argued that, as we got more and more data (n), the variance took care of itself in the usual statistical way.

These lessons generalize to the non-linear vs linear debate as well. Choosing to use a linear model is itself a variance reducing choice - there are ‘more’ curves than lines in some sense. If we restrict our attention to linear models only, we are potentially accepting some Model Bias, again with a hope of reducing variance to get overall better performance. As such, the same intuition as above applies: linear models are preferred when variance is the primary concern, either from noisy data or from small data; as we get more data, variance decreases naturally, so going to non-linear models reduces bias, giving overall smaller error.

In the non-linear context, the ‘estimation bias / variance’ trade-off remains, but modern tools like mgcv essentially handle this automatically for us. It is of course still there, but mgcv has some nifty auto-tuning built-in.

Other Topics in Regression

Alternative Loss Functions

For most of this class, we have used MSE as our loss function, building on a foundation of least squares.

Squared (22) error is a bit interesting: as an error gets larger, it counts even more. This makes OLS/MSE methods sensitive to outliers (in the same way the mean is), but it also implies that OLS won’t work too hard to “over-minimize” small errors.

We can consider some other loss functions: - 1 error (absolute difference): robust to outliers, fits to the conditional median instead of of the conditional mean.

The resulting Median Absolute Deviation (MAD) regression no longer has a closed form, but can be solved quickly using tools like CVX.

  • Huber loss: a blend of 2 and 1 error, which uses 2 for small errors and 1 for big errors. HLδ(x)={12x2|x|<δδ(|x|δ/2)|x|δ

    This is still convex, so we can solve it with CVX.

  • ϵ-insensitive: Lϵ(x)=(|x|ϵ)+={0|x|<ϵ|x|ϵ|x|ϵ This captures the idea of “close enough is good enough” with “close enough” being defined by ϵ. It’s a bit unnatural statistically, but relates to an important classification method we’ll cover in a few weeks.

Multi-Task Regression

In some contexts, we may want to perform multiple regressions at the same time. That is, we have a vector of responses for each observation. The OLS generalization is straightforward: arg minB12YXBF2B^=(XTX)1XTY so we essentially just fit different OLS models for each element of the response. This is not super interesting.

Multi-task regression becomes interesting if we want to add structure to B: a particularly common requirement is to select the same set of features for each of the regression targets. We do this using a group lasso on each row of B: under this structure, if we select a feature for one part of the response, we’ll use it for every part. (Can you see why?)

arg minB12YXBF2+λj=1pBj2

Another common requirement is for B to be “low-rank” but we won’t consider that here. If you want to look into it, keywords are “nuclear-norm regularized.”

Weights and generalization

  • WLS loss: (yXβ)TW(yXβ) for known diagonal W. Weight samples based on their variance (higher variance gets smaller weights)

  • Weighted lasso: +λj=1pwj|βj|. Higher weights are less likely to be selected

    Basis of ‘adaptive’ lasso methods (not discussed)

  • Generalized ridge: +λ2βTΩβ

    Used to ‘smooth’ β^ - useful for ridge analogue of fused lasso

Footnotes

  1. In other classes, you may have been taught that polynomial regression is a linear model. It is definitely linear in the sense that that class would have used (linear in y) but we’re using a different sense here. The core insight of that analysis - that a non-linear function can be expressed as a linear combination of non-linear parts - will appear several times in these notes.↩︎

  2. In Python, the statsmodels package also implements LOESS fits.↩︎

  3. If you want to go much deeper into the practical use of additive (spline) models, Simon Wood’s book Generalized Additive Models is fantastic.↩︎