curve(sqrt(abs(x)^3 + 5) * cos(x), from=1, to=10)
STA 9890 - Non-Linear Regression
This week, we continue our discussion of ‘fancy’ regression from last week to consider non-linear models. While we introduce several methods for non-linear regression, we also pause to consider when non-linear methods can actually be expected to outperform linear methods.
Also note that we are arguably being a bit ‘sloppy’ about what we mean by ‘linear’ methods. For purposes of these notes, we define a ‘linear’ method as one where the fitted model has a linear response to changes in the inputs; formally, one where
This definition includes things like best-subsets and lasso regression, which are not linear functions of
Non-Linearity via Linear Combination of Non-Linear Parts
We have spent the previous few weeks developing a set of tools for building linear models of
For situations where the relationship is nearly (or even truly) linear, this approximation can be quite useful and good estimates of the
We can further refine this equality by decomposing bias a bit further:
Here, we have decomposed
: a measure of the systematic error arising from use of a linear model to predict a non-linear DGP : a measure of the systematic error arising from use of a regularized estimation procedure which exhibits shrinkage
Put another way, “Model Bias” results from use of a linear model when something non-linear should be used, while “Estimation Bias” arises from the use of a biased method (like ridge regression) instead of something nominally unbiased like OLS. For the previous two weeks, we have mainly focused on linear models for linear DGPs, so model bias has been zero. If we expand our gaze to non-linear DGPs, we have to deal with model bias a bit more directly. As with estimation bias, it is frequently worthwhile to accept model bias to reduce variance; see more discussion below.
Polynomial Expansion
You have likely already seen polynomial expansion (or polynomial regression) in previous courses. Essentially, PE fits a low(-ish)-order polynomial instead of a line to data. More formally, let
If we rearrange this a bit, we get
so we see that the regression coefficients are more-or-less the (suitably-averaged) partial derivatives of
Clearly, this linear approximation will do better in the situations where Taylor expansions generally perform better: when i) the derivative of
If we want to mitigate some of that model bias, we can choose to use a higher-order Taylor series. If we examine a second order Taylor series, we get a similar approximation as before, where
After some rearrangement this becomes:
We recognize the constant term (intercept) and first order terms (linear slopes) from before, but now we have added two types:
- Second order terms (
): these capture higher order non-linearities in a single variable. For example, if , then the ‘true fit’ really lies in the second order term instead of any linear approximation - Cross terms (
): these capture non-linear (non-additive) multi-variate relationships between features. You may have seen these before as interaction terms.
Rules of thumb vary as to which set of terms are more important. Historically, statisticians have tended to put in the interaction (cross) terms first, though this is far from universal practice. In particular, when dealing with ‘binary’ variables, the higher order term is unhelpful: if a feature is
Regardless, this is why we sometimes use ‘polynomial expansion’ of our original variables. We are trying to capture the higher-order (non-linear) terms of the Taylor approximation.
It is worth thinking about what happens when we add higher-order terms to our model. In particular, note that we i) pick up some correlation among features; and ii) we will generally have more variance since we have more features. In practice, the correlation isn’t too much of a problem and the actual polynomials fit by, e.g., poly
in R
are modified to remove correlation. The variance however can be a more significant problem.
Let’s see this in action. Let’s first consider fitting polynomial regression to a simple (univariate) non-linear function:
This is not too non-linear. It’s still very smooth (analytic in the jargon) and should be relatively easy to fit. Let’s also generate some noisy observations of this function.
<- 25
n # Linear regression doesn't care about the order of our data, but plotting does
# so sorting gives better visuals
<- sort(runif(n, 2, 10))
x <- sqrt(abs(x)^3 + 5) * cos(x)
Ey <- Ey + rnorm(n, sd=3)
y
<- seq(min(x), max(x), length.out=501)
x_grid <- sqrt(abs(x_grid)^3 + 5) * cos(x_grid)
Ey_grid
plot(x, y)
lines(x_grid, Ey_grid)
We can now fit some linear models to this data (shown in color): we use R’s built-in poly
function to automatically create the polynomials we seek. You can think of poly(x, k)
as creating k
features
plot(x, y)
lines(x_grid, Ey_grid)
for(k in 1:5){
<- lm(y ~ poly(x, k))
m
<- predict(m, data.frame(x = x_grid))
yhat lines(x_grid, yhat, col=k+1)
}
legend("bottomleft",
c("True Regression Function",
"First Order (Linear) Fit",
"Second Order (Quadratic) Fit",
"Third Order (Cubic) Fit",
"Fourth Order (Quartic) Fit",
"Fifth Order (Quintic) Fit"),
col=1:6, lty=1)
Clearly, as we increase the order of the polynomial, we get a better (in-sample) fit. This makes sense: we know that more features gives us a better fit all else being equal, so what’s the issue?
As usual, the issue occurs for extrapolation. You can actually see some issues already beginning to manifest at the ends of our prediction intervals: the higher-order polynomials go radically different directions as we extrapolate and even the ‘best fit’ (quintic) looks like it’s going to be too steep as we go further.
Let’s expand the prediction region for these fits:
<- seq(-5, 15, length.out=501)
x_pred <- sqrt(abs(x_pred)^3 + 5) * cos(x_pred)
Ey_pred plot(x_pred, Ey_pred, type="l")
points(x, y)
for(k in 1:5){
<- lm(y ~ poly(x, k))
m <- predict(m, data.frame(x=x_pred))
yhat lines(x_pred, yhat, col=k+1)
}
legend("bottomleft",
c("True Regression Function",
"First Order (Linear) Fit",
"Second Order (Quadratic) Fit",
"Third Order (Cubic) Fit",
"Fourth Order (Quartic) Fit",
"Fifth Order (Quintic) Fit"),
col=1:6, lty=1)
There’s some problems!
To be fair, it is a bit unreasonable to expect these models to perform too well outside of our original sampling area. What’s more worrying is that the extrapolations don’t just miss the curves of the true regression function, they actually create their own even more intense wiggles.
We can also manifest this without extrapolation by fitting very high-order polynomials:
<- 9
k <- lm(y ~ poly(x, degree=10 + k, raw=TRUE))
m <- predict(m, data.frame(x = x_grid)) yhat
plot(x, y)
lines(x_grid, Ey_grid)
for(k in 1:5){
<- lm(y ~ poly(x, degree=14 + k, raw=TRUE))
m <- predict(m, data.frame(x = x_grid))
yhat lines(x_grid, yhat, col=k)
}
Note that we have to use raw=TRUE
here to avoid R
complaining about having too-high a degree in the polynomial. (R
gives an error about “unique points” but the real issue is one about matrix rank, not ties in x
.) I’m also surpressing some warnings here about a sub-standard fit.
So how can we address this ‘over-wiggliness’?
As we discussed last time, we need to use a different set of functions: one that is smooth (like polynomials), but not too-high order. It would also be really nice if we could get something ‘adaptive’ - allowing for more wiggliness where we have more data (and we need more wiggliness) and less wiggliness where we don’t have enough data (fall-back to linearity). ### Local Linear Models
One way to do this is the idea of “local linear (polynomial) models.” Instead of fitting a single (global) line, we can fit different lines in different parts of our data: that way, we can get an upward line when the true (non-linear) relationship is increasing and a downward line when the true relationship is decreasing. Or at least that’s the hope!
plot(x, y)
lines(x_grid, Ey_grid)
# Divide our data into five 'buckets' and fit sub-models
# This works because we've already sorted our data
# (What would happen if we hadn't?)
for(k in 1:5){
<- x[(5 * k - 4):(5*k)]
xk <- y[(5 * k - 4):(5*k)]
yk <- lm(yk ~ xk)
mk <- predict(mk)
y_hat lines(xk, y_hat, col="red4")
}
Definitely some rough edges here - particularly in the areas between our buckets, but we’re on a good path. The locfit
package will help us with the details:
library(locfit)
locfit 1.5-9.11 2025-01-27
<- (locfit(y ~ lp(x)))
m summary(m)
Estimation type: Local Regression
Call:
locfit(formula = y ~ lp(x))
Number of data points: 25
Independent variables: x
Evaluation structure: Rectangular Tree
Number of evaluation points: 7
Degree of fit: 2
Fitted Degrees of Freedom: 5.194
plot(x, y)
lines(x_grid, Ey_grid)
<- predict(m, data.frame(x = x_grid))
y_hat lines(x_grid, y_hat, col="red4")
Not perfect, but pretty nice compared to what we had before.
You can also see a discussion of the “degrees of freedom” in the model output. This is not exactly the DoF you learned in earlier classes, but it’s sort of “morally equivalent.” This local fit is about as flexible as a 4th degree polynomial would be for this problem. Even though this model is made out of quadratics, it’s more flexible than a single quadratic. But we avoid the extreme variability associated with a polynomial of that high order. Win-win!
We can tweak some parameters of the local fit to get different responses: the big ones are the degree (deg
) and the number of neighbors to use.
?lp
<- (locfit(y ~ lp(x, deg=4)))
m summary(m)
Estimation type: Local Regression
Call:
locfit(formula = y ~ lp(x, deg = 4))
Number of data points: 25
Independent variables: x
Evaluation structure: Rectangular Tree
Number of evaluation points: 7
Degree of fit: 4
Fitted Degrees of Freedom: 6.986
plot(x, y)
lines(x_grid, Ey_grid)
<- predict(m, data.frame(x = x_grid))
y_hat lines(x_grid, y_hat, col="red4")
Pretty nice. If we ‘turn down’ the number of neighbors used:
<- (locfit(y ~ lp(x, nn=0.2, deg=4))) m
Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth =
geth, : Estimated rdf < 1.0; not estimating variance
summary(m)
Estimation type: Local Regression
Call:
locfit(formula = y ~ lp(x, nn = 0.2, deg = 4))
Number of data points: 25
Independent variables: x
Evaluation structure: Rectangular Tree
Number of evaluation points: 27
Degree of fit: 4
Fitted Degrees of Freedom: 24.257
plot(x, y)
lines(x_grid, Ey_grid)
<- predict(m, data.frame(x = x_grid))
y_hat lines(x_grid, y_hat, col="red4")
Too far - super crazy! But not entirely unexpected - we know that
Too far!
The loess
function in base R
does this very nicely as well without requiring additional packages. For practical work, it’s a very nice tool for univariate modeling.2
<- loess(y ~ x)
m summary(m)
Call:
loess(formula = y ~ x)
Number of Observations: 25
Equivalent Number of Parameters: 4.64
Residual Standard Error: 4.098
Trace of smoother matrix: 5.1 (exact)
Control settings:
span : 0.75
degree : 2
family : gaussian
surface : interpolate cell = 0.2
normalize: TRUE
parametric: FALSE
drop.square: FALSE
plot(x, y)
lines(x_grid, Ey_grid)
<- predict(m, data.frame(x = x_grid))
y_hat lines(x_grid, y_hat, col="red4")
Spline Regression (Additive Models)
Splines
We can generalize this idea a bit using splines. Splines are functions that solve a penalized approximation problem:
Here we are saying we’ll take any function (not just a polynomial) that achieves the optimal trade-off between data fit (first / loss term) and not being too rough (second / penalty term).
Why is this the right penalty to use? In essence, we are trying to penalize ‘deviation from linearity’ and since linear functions have second derivative 0 (by definition) the integral of the second derivative gives us a measure of non-linearity.
In some remarkable work, Grace Wahba and co-authors showed that the solutions to that optimization problem are piecewise polynomials with a few additional constraints - these functions are splines. In addition to piecewise polynomialness, splines also guarantee:
- continuity of the function and its derivatives
- linearity outside the range of data fit (‘natural splines’)
Because splines not only match the value of the function at the knots (places where the ‘pieces’ match up) but also the derivatives, they are very smooth indeed.
Stepping back, splines are just a different sort of feature engineering. Instead of using polynomial basis functions (
R
provides the basic tools for spline fitting in the splines
package. For real work, you almost surely want to use the advanced functionality of the mgcv
package. For more on splines and models using them (‘additive models’), see this online course.
We can see that individual splines are quite nice little functions and usually are only non-zero on small regions:
library(splines)
<- ns(x_grid, df=5) # The natural spline basis
nsx plot(x_grid, apply(nsx, 1, max), ylim=range(nsx), type="n")
for(i in 1:5){
lines(x_grid, nsx[,i], col=i)
}
We can use linear combinations of these ‘bumps’ to fit non-linear functions, including our example from above:
<- lm(y ~ ns(x, df=5))
m plot(x, y)
lines(x_grid, Ey_grid)
<- predict(m, data.frame(x = x_grid))
y_hat lines(x_grid, y_hat, col="red4")
(This basically matches what we had before with df
parameter here.)
Our predicted response is actually the sum of the individual spline effects:
<- lm(y ~ ns(x, df=5))
m <- predict(m, data.frame(x = x_grid))
y_hat
plot(x, y, lwd=2)
lines(x_grid, Ey_grid)
for(i in 1:5){
lines(x_grid, ns(x_grid, df=5)[,i] * coef(m)[-1][i], col=i+1)
}
If we were able to sum up the colored lines, we would get the black line back.
We can also see that natural splines give smooth predictions outside the range of the original data:
<- predict(m, data.frame(x = x_pred))
y_pred
plot(x_pred, Ey_pred, type="l")
lines(x_pred, y_pred, col="red4")
Admittedly, not ideal, but it will do. In particular, note that outside of the ‘data region’, we fall back on a nice comfortable
(You can repeat this process with the slightly more common
<- lm(y ~ bs(x, df=5))
m <- predict(m, data.frame(x = x_grid))
y_hat
plot(x, y, xlim=range(x_grid))
lines(x_grid, Ey_grid)
for(i in 1:5){
lines(x_grid, bs(x_grid, df=5)[,i] * coef(m)[-1][i], col=i+1)
}
<- predict(m, data.frame(x = x_grid))
y_pred
lines(x_grid, y_pred, col="red4", lwd=2)
For a ‘full-strength’ version of spline fitting, you can use the mgcv
package:
library(mgcv)
Loading required package: nlme
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
<- gam(y ~ s(x))
m
summary(m)
Family: gaussian
Link function: identity
Formula:
y ~ s(x)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2852 0.5974 -7.173 1.1e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 5.954 7.038 70.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.954 Deviance explained = 96.5%
GCV = 12.359 Scale est. = 8.9212 n = 25
<- gam(y ~ s(x))
m <- predict(m, data.frame(x = x_grid))
y_hat
plot(x, y, xlim=range(x_grid))
lines(x_grid, Ey_grid)
lines(x_grid, y_hat, col="red4", lwd=2)
Run ?s
to see the many features of the mgcv
package. mgcv
also includes several specialized splines (e.g. periodic) which you may find useful in some problems.
Splines are not normally consider a ‘machine learning’ tool but they are incredibly useful in applied statistics work. (The two methods you will rely on most from this class are spline regression and random forests.)
-Filtering
Kernel Methods
Manual feature expansion
We can think of feature expansion as mapping our data to a higher-dimensional space
Can we generalize this formally to more interesting maps
Ridge without coefficients
Before introducing kernel methods formally, let’s look back to ridge regression. We showed that the ridge solution is given by
Some nifty linear algebra will let us rewrite this as
You can prove this using the Woodbury Matrix Identity or we can just check it for a single data set and trust that it generalizes:
<- 100
n <- 5
p
<- matrix(rnorm(n * p), nrow=n, ncol=p)
X
<- X[,1] + sqrt(X[,2]^2 + 5) + 0.1 * X[,3]^3 + cos(abs(X[,4])) + 1/(abs(X[,5]) + 3)
Ey <- Ey + rnorm(n, sd=sqrt(0.25))
y
<- 2
lambda <- function(p) diag(1, p, p)
eye <- solve(crossprod(X) + lambda * eye(p), crossprod(X, y))
beta_hat_ridge
<- t(X) %*% solve(tcrossprod(X) + lambda * eye(n)) %*% y
beta_hat_ridge_alt
cbind(beta_hat_ridge, beta_hat_ridge_alt)
[,1] [,2]
[1,] 1.42746826 1.42746826
[2,] 0.15151370 0.15151370
[3,] 0.57121814 0.57121814
[4,] 0.04425193 0.04425193
[5,] 0.58742212 0.58742212
So they are the same! But why did we do this?
It is sometimes a bit more computationally efficient if we have to invert an
Note that, if we want to make a ridge regression prediction now for a new
Intuitively, recall that the inner product of two vectors measures the angle between them and can, up to some scaling, be used as a ‘similarity’ measure. This result essentially says that our prediction on a new data set is a weighted average of the training data, with weights based on similarity of the new point with the training data. This is not a crazy structure…
If we look closer, we see that we only need to compute products of the form
Kernel Trick
Earlier, we considered non-linear regression by feature expansion, where we replaced
Longer, but not necessarily better. If
It turns out, however, that we never need to actually form
Functions that allow this are called (Mercer) kernels and they are just a little bit magical.
If we let
If we break this apart, we see that our predictions at the new point
Let’s try out Kernel(ized) Ridge Regression:
library(kernlab)
## Define a kernel function - we'll spell this out below.
<- rbfdot(sigma = 0.05)
rbf
## calculate kernel matrix
<- kernelMatrix(rbf, X)
K dim(K)
[1] 100 100
Note that this is a
We can now use this to make predictions:
<- function(x, lambda=1) {
krr crossprod(y, solve(K + lambda * eye(n), kernelMatrix(rbf, X, x)))
}
<- mean((y - krr(X))^2)
krr_MSE print(krr_MSE)
[1] 0.4916979
This is better than the OLS MSE:
mean(resid(lm(y ~ X))^2)
[1] 0.5484549
Why? How is this not a contradiction of everything said before.
But how do we actually do kernel multiplication?
There are many kernel functions in the world. The most common are defined by:
. This is the linear kernel, equivalent to non-kernel methods . This is the polynomial kernel, equivalent to fitting polynomial regression up to degree with all cross products. (The role of is tricky, but it essentially controls the relative weight of the higher and lower order terms) . This is the squared exponential or radial basis function (RBF) kernel; it is very popular in spatial statistics, where it is closely related to a method known as kriging.
While the first two kernels give us a natural
For even more kernels, see:
Much of our current understanding of deep learning is based on specialized kernel methods:
- The Neural Tangent Kernel
- Kernel Methods for Deep Learning
- Gradient Descent as a Kernel Machine (This paper is a bit controversial)
This blog is a particularly nice (but still somewhat technical) introduction to the Neural Tangent Kernel and its relationship to neural networks.
Spline theory has also found use to explain ReLU (piecewise linear activation) neural networks:
- Mad Max: Spline Insights into Deep Learning
- Representer Theorems for Neural Networks and Ridge Splines
We can also kernelize some of the other methods we’ve studied in this course.
Notably, we can kernelize
If we compute distances this way for a given
Trade-Offs between Linear and Non-Linear Methods
Having developed some tools for non-linear regression, let’s step back and ask whether they are worthwhile. Recall our error decomposition:
We have already discussed how tools like ridge and lasso let us remove variance at the cost of (estimation) bias.
We argued that this was a worthwhile trade when the problem had a lot of ‘innate’ variance, either from having large noise variance (
These lessons generalize to the non-linear vs linear debate as well. Choosing to use a linear model is itself a variance reducing choice - there are ‘more’ curves than lines in some sense. If we restrict our attention to linear models only, we are potentially accepting some Model Bias, again with a hope of reducing variance to get overall better performance. As such, the same intuition as above applies: linear models are preferred when variance is the primary concern, either from noisy data or from small data; as we get more data, variance decreases naturally, so going to non-linear models reduces bias, giving overall smaller error.
In the non-linear context, the ‘estimation bias / variance’ trade-off remains, but modern tools like mgcv
essentially handle this automatically for us. It is of course still there, but mgcv
has some nifty auto-tuning built-in.3
Other Topics in Regression
Alternative Loss Functions
For most of this class, we have used MSE as our loss function, building on a foundation of least squares.
Squared (
We can consider some other loss functions: -
The resulting Median Absolute Deviation (MAD) regression no longer has a closed form, but can be solved quickly using tools like CVX.
Huber loss: a blend of
and error, which uses for small errors and for big errors.This is still convex, so we can solve it with
CVX.
-insensitive: This captures the idea of “close enough is good enough” with “close enough” being defined by . It’s a bit unnatural statistically, but relates to an important classification method we’ll cover in a few weeks.
Multi-Task Regression
In some contexts, we may want to perform multiple regressions at the same time. That is, we have a vector of responses for each observation. The OLS generalization is straightforward:
Multi-task regression becomes interesting if we want to add structure to
Another common requirement is for
Weights and generalization
WLS loss:
for known diagonal . Weight samples based on their variance (higher variance gets smaller weights)Weighted lasso:
. Higher weights are less likely to be selectedBasis of ‘adaptive’ lasso methods (not discussed)
Generalized ridge:
Used to ‘smooth’
- useful for ridge analogue of fused lasso
Footnotes
In other classes, you may have been taught that polynomial regression is a linear model. It is definitely linear in the sense that that class would have used (linear in
) but we’re using a different sense here. The core insight of that analysis - that a non-linear function can be expressed as a linear combination of non-linear parts - will appear several times in these notes.↩︎In Python, the
statsmodels
package also implements LOESS fits.↩︎If you want to go much deeper into the practical use of additive (spline) models, Simon Wood’s book Generalized Additive Models is fantastic.↩︎