Session08 Estimation Theory - Least Squares
Goran S. Milovanovic, PhD
The logic of statistical modeling explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization.
What do we want to do today?
We will dive deep into theory today in an effort to understand the logic of statistical modeling by error minimization. We will talk a bit about different types of mathematical models that are being developed and used in contemporary Data Science, Machine Learning, and AI. We will distinguish between supervised and unsupervised learning on one, and then reinforcement learning on the other hand. Then we focus on supervised learning and the simple regression model again in order to understand how mathematical models are estimated from empirical data. We will demonstrate several different approaches to estimate a given model’s parameters from data and discuss why they all converge along the same lines of perspective towards one single, important insight: statistics is learning.
0. Prerequisits
Setup:
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(viridisLite) # colour scale
library(RANN)
1. Understanding Linear Regression
1.1 Sums of Squares in Simple Linear Regression
Once again: Sepal.Length
predicts
Petal.Length
in iris
:
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length)) +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
geom_smooth(method='lm', size = .25, color = "red") +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Let’s recall the form of the simple linear regression model:
\[Y = \beta_0 + \beta_1X_1 + \epsilon\]
or
\[outcome_i = model + error_i\]
where \(i\) is an index across the
whole dataset (i.e. each row, each pair of Sepal.Length
and
Petal.Lenth
, each observation). So, statistical models make
errors in their predictions, of course. Then what model works the best
for a given dataset? The one that minimizes the error, of
course.
Take a look at the following chart:
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length)) +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
geom_hline(yintercept = mean(iris$`Petal.Length`),
linetype = "dashed",
color = "red") +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
The horizontal line in this chart has na intercept of
3.758
, which is the mean of iris$Petal.Length
,
the outcome variable in our simple linear regression model. If there
were not a predictor iris$Sepal.Length
, the mean of \(Y\) would be our best possible guess about
any new value observed on that variable. Let’s take a look at the
residuals of the outcome variable from its own mean:
linFit <- iris
linFit$Petal.Length.AtMean <- linFit$Petal.Length - mean(linFit$Petal.Length)
ggplot(data = linFit,
aes(x = Sepal.Length,
y = Petal.Length)) +
geom_hline(yintercept = mean(iris$`Petal.Length`),
linetype = "dashed",
color = "red") +
geom_segment(aes(x = Sepal.Length,
y = Petal.Length,
xend = Sepal.Length,
yend = Petal.Length - Petal.Length.AtMean),
color = "black", size = .2, linetype = "dashed") +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
geom_point(aes(x = Sepal.Length, y = mean(iris$Petal.Length)), color = "red", size = 1) +
xlab("Sepal.Length") + ylab("Petal.Length") +
theme_classic() +
theme_bw() +
theme(panel.border = element_blank()) +
theme(strip.background = element_blank()) +
theme(axis.text.x = element_text(size = 6)) +
theme(axis.text.y = element_text(size = 6))
What is our error, in total, if we start predicting the values of a variable from its mean alone, without any other predictor at hand? Enters the Total Sum of Squares, \(SS_T\):
ss_total <- sum((iris$Petal.Length - mean(iris$Petal.Length))^2)
print(ss_total)
## [1] 464.3254
Ok, now back to the simple linear regression model
Petal.Length ~ Sepal.Length
:
linFit <- lm(data = iris,
Petal.Length ~ Sepal.Length)
linFit <- data.frame(
x = iris$Sepal.Length,
y = iris$Petal.Length,
predicted = linFit$fitted.values,
residuals = linFit$residuals
)
ggplot(data = linFit,
aes(x = x, y = y)) +
geom_smooth(method = lm, se = F, color = "red", size = .25) +
geom_segment(aes(x = x,
y = predicted,
xend = x,
yend = predicted + residuals),
color = "black", size = .2, linetype = "dashed") +
geom_point(aes(x = x, y = y), color = "black", size = 2) +
geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
xlab("Sepal.Length") + ylab("Petal.Length") +
theme_classic() +
theme_bw() +
theme(panel.border = element_blank()) +
theme(strip.background = element_blank()) +
theme(axis.text.x = element_text(size = 6)) +
theme(axis.text.y = element_text(size = 6))
The Residual Sum of Squares, \(SS_R\), is the sum of squared distances from the observed data points to the model’s respective predictions:
slr_model <- lm(data = iris,
Petal.Length ~ Sepal.Length)
ss_residual <- sum(residuals(slr_model)^2)
print(ss_residual)
## [1] 111.4592
Finally, the Model Sum of Squares, \(SS_M\).
linFit <- lm(data = iris,
Petal.Length ~ Sepal.Length)
linFit <- data.frame(
x = iris$Sepal.Length,
y = iris$Petal.Length,
predicted = linFit$fitted.values,
meanY = mean(iris$Petal.Length)
)
ggplot(data = linFit,
aes(x = x, y = y)) +
geom_smooth(method = lm, se = F, color = "red", size = .25) +
geom_hline(yintercept = mean(iris$`Petal.Length`),
linetype = "dashed",
color = "red") +
geom_segment(aes(x = x,
y = predicted,
xend = x,
yend = meanY),
color = "red", size = .2, linetype = "dashed") +
geom_point(aes(x = x, y = y), color = "black", size = 2) +
geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
xlab("Sepal.Length") + ylab("Petal.Length") +
theme_classic() +
theme_bw() +
theme(panel.border = element_blank()) +
theme(strip.background = element_blank()) +
theme(axis.text.x = element_text(size = 6)) +
theme(axis.text.y = element_text(size = 6))
The Model Sum of Squares, \(SS_M\), is based on the distances between the mean of the outcome and the model predictions:
ss_model <- sum((linFit$predicted - linFit$meanY)^2)
print(ss_model)
## [1] 352.8662
Now:
print(ss_total - ss_residual)
## [1] 352.8662
!!! The complete error present in an attempt to predict
Petal.Length
from its mean alone, \(SS_T\), can be thus decomposed into \(SS_R\) and \(SS_M\):
\[SS_{total} = SS_{model} + SS_{residual} = SS_T = SS_M + SS_R\] #### 1.2 \(F-test\), \(t-test\), and \(R^2\) in Simple Linear Regression
But there are more tricks waiting to be pulled, see:
slr_model <- lm(data = iris,
Petal.Length ~ Sepal.Length)
summary(slr_model)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47747 -0.59072 -0.00668 0.60484 2.49512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
## Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
What is \(SS_M/SS_T\)?
print(ss_model/ss_total)
## [1] 0.7599546
So we have: \(R^2 = SS_M/SS_T\)!
Now, remember that \(F-test\) in the output that has been mentioned in the past?
Let’s divide \(SS_M\) and \(SS_R\) by their respective degrees of freedom. For the \(df_R\) in the simple linear regression (as in other models as well) we take the number of observations minus the numbers of the parameters in the model (two, in our case: the intercept and the slope), while for the \(df_M\) we need to take only the number predictors in the model (one, in our case):
df_residual <- dim(iris)[1] - 2 # - num.obs - num.parameters
print(paste0("df_residual is: ", df_residual))
## [1] "df_residual is: 148"
df_model <- 1 # - how many variables?
print(paste0("df_model is: ", df_model))
## [1] "df_model is: 1"
ms_model <- ss_model/df_model
ms_residual <- ss_residual/df_residual
print(paste0("MS_model is: ", ms_model))
## [1] "MS_model is: 352.866244880182"
print(paste0("MS_residual is: ", ms_residual))
## [1] "MS_residual is: 0.753102399458234"
They are now called mean squares, our \(MS_M\) and \(MS_R\). Ok, now:
f_test <- ms_model/ms_residual
print(f_test)
## [1] 468.5502
And know we know that
\(F = MS_{Model}/MS_{Residual} = MS_M/MS_R\). The \(F-test\) follows the \(F-distribution\) with \(df_M\) and \(df_R\) degrees of freedom:
x <- rf(100000, df1 = ms_model, df2 = ms_residual)
hist(x,
freq = FALSE,
xlim = c(0,3),
ylim = c(0,1),
xlab = '',
main = ('F-distribution with df_1 = MS_M and df_2 = MS_R degrees of freedom (df)'),
cex.main=0.9)
curve(df(x, df1 = 10, df2 = 20), from = 0, to = 4, n = 5000, col= 'pink', lwd=2, add = T)
And another thing to observe:
print(f_test)
## [1] 468.5502
slr_summary <- summary(slr_model)
slr_coeffs <- data.frame(slr_summary$coefficients)
print(slr_coeffs)
## Estimate Std..Error t.value Pr...t..
## (Intercept) -7.101443 0.50666229 -14.01613 6.133586e-29
## Sepal.Length 1.858433 0.08585565 21.64602 1.038667e-47
slr_ttest <- slr_coeffs$t.value[2]
print(slr_ttest^2)
## [1] 468.5502
print(f_test)
## [1] 468.5502
N.B. Complicated things have many correct interpretations and connections that exist among them. For example:
- the best simple linear statistical is the one that maximizes the effect of the model (\(MS_M\)) while minimizing the effect of the error (\(MS_R\)), which means
- that the best simple linear statistical model is simply the one that minimizes the error, since \(SS_T = SS_M + SS_R\), and \(SS_T\) is fixed since it only says about the dispersion of the outcome variable around its mean; then,
- the overall effect of the model can be measured by the extent of the \(F-test\), which is essentially a ratio of two variances, \(MS_M\) and \(MS_R\), telling us how much does the model “works” beyond the error it produces, and which is a simple mathematical transformation of the
- \(t-test\) of whether the model’s slope (\(\beta_1\)) is different from zero in which case we know that the regression line “works” because it is statistically different from a horizontal line of no correlation.
2. On Minimizing Errors in Statistical Modeling
2.1 Grid Search in the Parameter Space
I want to write a function now. The function takes the following as its inputs:
- a dataset with one predictor and one outcome variable
- a pair of parameters, \(/beta_0\) and \(/beta_1\),
and it returns the sum of squared errors (i.e. residuals) from the simple linear regression of the well known \(Y = \beta_0 + \beta_1X +\epsilon\) form.
Here is what I want essentially:
d <- data.frame(x = iris$Sepal.Length,
y = iris$Petal.Length)
residuals <- summary(lm(y ~ x, data = d))$residuals
print(sum(residuals^2))
## [1] 111.4592
But I do not want to use lm()
: instead, I want to be
able to specify \(/beta_0\) and \(/beta_1\) myself:
sse <- function(d, beta_0, beta_1) {
predictions <- beta_0 + beta_1 * d$x
residuals <- d$y - predictions
return(sum(residuals^2))
}
Ok. Now, the lm()
function in R can find the optimal
values of the parameters \(/beta_0\)
and \(/beta_1\), right?
slr_model <- lm(y ~ x, data = d)
coefficients(slr_model)
## (Intercept) x
## -7.101443 1.858433
Let’s test our sse()
function:
beta_0_test <- coefficients(slr_model)[1]
beta_1_test <- coefficients(slr_model)[2]
sse(d = d,
beta_0 = beta_0_test,
beta_1 = beta_1_test)
## [1] 111.4592
And from lm()
again:
d <- data.frame(x = iris$Sepal.Length,
y = iris$Petal.Length)
residuals <- summary(lm(y ~ x, data = d))$residuals
print(sum(residuals^2))
## [1] 111.4592
So we know that our sse()
works just fine.
Now: how could have determined the optimal values - the error
minimizing values - of \(/beta_0\)
and \(/beta_1\) without relying
on lm()
?
One idea is to move across the space of the parameter values in small steps and compute the model error at each point in that space, for example:
test_beta_0 <- seq(-10, 10, by = .1)
test_beta_1 <- seq(-10, 10, by = .1)
model_errors <- lapply(test_beta_0, function(x) {
return(
rbindlist(
lapply(test_beta_1, function(y) {
s <- sse(d = d, beta_0 = x, beta_1 = y)
return(data.frame(sse = s,
beta_0 = x,
beta_1 = y))
}))
)
})
model_errors <- rbindlist(model_errors)
head(model_errors)
## sse beta_0 beta_1
## <num> <num> <num>
## 1: 796216.9 -10 -10.0
## 2: 783371.7 -10 -9.9
## 3: 770631.0 -10 -9.8
## 4: 757994.7 -10 -9.7
## 5: 745462.9 -10 -9.6
## 6: 733035.6 -10 -9.5
What would be the most optimal estimates of of \(/beta_0\) and \(/beta_1\) then?
model_errors[which.min(model_errors$sse), ]
## sse beta_0 beta_1
## <num> <num> <num>
## 1: 111.9305 -7.3 1.9
Compare:
coefficients(slr_model)
## (Intercept) x
## -7.101443 1.858433
Hm, not bad?
2.2 Sample the Parameter Space
Another idea that comes to mind is the following one: why not take a
uniform random sample from the Parameter Space and check out the
sse()
at various points defined by their \(/beta_0\) and \(/beta_1\) coordinates?
sample_parameters <- data.frame(beta_0 = runif(100000, -10, 10),
beta_1 = runif(100000, -10, 10))
head(sample_parameters)
## beta_0 beta_1
## 1 -5.6901721 7.273736
## 2 4.9740709 -8.643107
## 3 0.9928063 -8.036839
## 4 5.6476493 -8.047716
## 5 -7.0712216 -4.245558
## 6 2.4955622 -5.882239
Now let’s find the model errors at each randomly sampled combination of parameters:
sample_parameters$sse <- apply(sample_parameters, 1, function(x) {
sse(d, x[1], x[2])
})
head(sample_parameters)
## beta_0 beta_1 sse
## 1 -5.6901721 7.273736 166999.5
## 2 4.9740709 -8.643107 375782.0
## 3 0.9928063 -8.036839 381033.4
## 4 5.6476493 -8.047716 315724.0
## 5 -7.0712216 -4.245558 194422.1
## 6 2.4955622 -5.882239 196703.9
And what would be the most optimal estimates of of \(/beta_0\) and \(/beta_1\) in this case?
sample_parameters[which.min(sample_parameters$sse), ]
## beta_0 beta_1 sse
## 32621 -7.270532 1.885291 111.555
Compare:
coefficients(slr_model)
## (Intercept) x
## -7.101443 1.858433
Hm?
N.B. Finding the optimal values of the model’s parameters implies some sort of search through the Paramater Space, and picking the values that minimize the model error as much as possible.
But there are better ways to do it than Grid Search or Random Sampling. And whenever that is possible, this is what we do to our statistical learning models: we optimize them.
Please pay close attention to what exactly is happening in these procedures:
- the dataset
d
is a constant, it does not change in any of thesse()
function’s run; - the parameters \(/beta_0\) and \(/beta_1\) vary in some way (until now: grid search or random uniform sampling), and
- the
sse()
function does not estimate anything - it is notlm()
! - but computes the model error instead. So what are we really looking for? We are looking for a way to find the minimum of oursse()
function.
2.3 Optimize the Simple Linear Regression model w. base R optim()
First we need a slight rewrite of sse()
only:
sse <- function(params) {
beta_0 <- params[1]
beta_1 <- params[2]
# - MODEL IS HERE:
predictions <- beta_0 + beta_1 * d$x
# - MODEL ENDS HERE ^^
residuals <- d$y - predictions
return(sum(residuals^2))
}
N.B. As the dataset d
is a constant, it
does not play a role of an sse()
function parameter
anymore.
Pick some random, initial values for \(/beta_0\) and \(/beta_1\):
beta_0_start <- runif(1, -10, 10)
beta_1_start <- runif(1, -10, 10)
print(beta_0_start)
## [1] -1.297957
print(beta_1_start)
## [1] 1.298156
Now optimize sse()
:
solution <- optim(par = c(beta_0_start, beta_1_start),
fn = sse,
method = "Nelder-Mead",
lower = -Inf, upper = Inf)
print(solution$par)
## [1] -7.099650 1.858104
Compare:
coefficients(slr_model)
## (Intercept) x
## -7.101443 1.858433
Now that looks great!
What are we really looking at here is this (first reducing the
sample_parameters
dataframe a bit… :-)
# - back to the old version of sse():
sse <- function(d, beta_0, beta_1) {
predictions <- beta_0 + beta_1 * d$x
residuals <- d$y - predictions
return(sum(residuals^2))
}
# - sample parameters on a different range for plotting purposes
sample_parameters <- data.frame(beta_0 = runif(1e6, -30, 30),
beta_1 = runif(1e6, -10, 10))
sample_parameters$sse <- apply(sample_parameters, 1, function(x) {
sse(d, x[1], x[2])
})
head(sample_parameters)
## beta_0 beta_1 sse
## 1 -10.791984 1.637388 3839.764
## 2 24.493326 9.676575 902157.853
## 3 6.908973 -9.681987 441838.578
## 4 -10.175620 -2.559840 127314.734
## 5 -18.576823 -3.927809 311159.732
## 6 -19.352316 5.397247 12044.594
# Wire‑frame 3‑D error surface with *base* R `persp()`
# ====================================================
# *Focus:* You asked for a way to change how dense—or sparse—the
# grid looks **and** to tweak the grid‑line colour. In base R
# `persp()` that boils down to:
# • **`res`** → how many grid divisions we sample (higher ⇒ tighter mesh)
# • **`border` / `lwd`** → colour and thickness of those lines.
# The code below promotes all of these to the top “CONFIG” block
# so you can adjust instantly.
# ── CONFIG ──────────────────────────────────────────────
phi_angle <- 35 # altitude camera angle
theta_angle <- 40 # azimuth camera angle
expand_fac <- 0.8 # zoom (<1 zooms out, >1 zooms in)
res <- 50 # grid resolution: 100=sparse · 300=dense
border_col <- "grey30" # mesh line colour (single value)
lwd_grid <- 0.4 # line width (≈0.2 hairline, 1 thick)
# Font scaling (1 = default)
cex_axis <- 0.7 # tick‑label size
cex_lab <- 0.8 # axis‑title size
cex_main <- 0.85 # main‑title size
# ── 1. Regularise (β₀, β₁, SSE) into a matrix ─────────
# Assumes `sample_parameters` with columns: beta_0, beta_1, sse
x_vals <- seq(min(sample_parameters$beta_0),
max(sample_parameters$beta_0), length.out = res)
y_vals <- seq(min(sample_parameters$beta_1),
max(sample_parameters$beta_1), length.out = res)
grid_df <- expand.grid(beta_0 = x_vals, beta_1 = y_vals)
nearest_id <- RANN::nn2(sample_parameters[, c("beta_0", "beta_1")],
grid_df, k = 1)$nn.idx
grid_df$sse <- sample_parameters$sse[nearest_id]
z_mat <- matrix(grid_df$sse,
nrow = length(y_vals), ncol = length(x_vals), byrow = TRUE)
# ── 2. Plot with `persp()` ─────────────────────────────
# Optional device for saving:
# png("error_landscape_wireframe.png", width = 7, height = 6, units = "in", res = 300)
old_par <- par(no.readonly = TRUE) # save graphics settings
par(cex.axis = cex_axis,
cex.lab = cex_lab,
cex.main = cex_main)
persp(x = x_vals,
y = y_vals,
z = z_mat,
theta = theta_angle,
phi = phi_angle,
expand = expand_fac,
col = NA, # facets transparent
border = border_col, # mesh line colour
lwd = lwd_grid, # mesh line width
ticktype = "detailed",
xlab = "Beta 0",
ylab = "Beta 1",
zlab = "SSE",
main = "Simple Linear Regression Error Landscape")
par(old_par) # restore previous settings
# dev.off()
# ── Tips ────────────────────────────────────────────────
# • **Density:** Set `res` to 100–150 for a coarse grid; 300–400 for
# publication‑quality fine wire.
# • **Colour:** `border_col` accepts any single R colour (e.g.,
# "black", "dodgerblue", rgb()). Per‑facet colouring isn’t
# available in base `persp()`; you’d need rgl or lattice for that.
# • **Thickness:** `lwd_grid` controls the weight of grid lines.
# • Remember that very high `res` (>500×500) can slow plotting and
# eat memory; adjust to your machine’s limits.
sample_parameters[which.min(sample_parameters$sse), ]
## beta_0 beta_1 sse
## 834649 -7.013697 1.840932 111.5221
Compare:
coefficients(slr_model)
## (Intercept) x
## -7.101443 1.858433
Further Readings
R Markdown
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.