Session08 Estimation Theory - Least Squares

The logic of statistical modeling explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization.

What do we want to do today?

We will dive deep into theory today in an effort to understand the logic of statistical modeling by error minimization. We will talk a bit about different types of mathematical models that are being developed and used in contemporary Data Science, Machine Learning, and AI. We will distinguish between supervised and unsupervised learning on one, and then reinforcement learning on the other hand. Then we focus on supervised learning and the simple regression model again in order to understand how mathematical models are estimated from empirical data. We will demonstrate several different approaches to estimate a given model’s parameters from data and discuss why they all converge along the same lines of perspective towards one single, important insight: statistics is learning.

0. Prerequisits

Setup:

dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(viridisLite)   # colour scale
library(RANN)

1. Understanding Linear Regression

1.1 Sums of Squares in Simple Linear Regression

Once again: Sepal.Length predicts Petal.Length in iris:

ggplot(data = iris,
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
  geom_smooth(method='lm', size = .25, color = "red") +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Let’s recall the form of the simple linear regression model:

\[Y = \beta_0 + \beta_1X_1 + \epsilon\]

\[outcome_i = model + error_i\]

where $i$ is an index across the whole dataset (i.e. each row, each pair of Sepal.Length and Petal.Lenth, each observation). So, statistical models make errors in their predictions, of course. Then what model works the best for a given dataset? The one that minimizes the error, of course.

Take a look at the following chart:

ggplot(data = iris,
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
  geom_hline(yintercept = mean(iris$`Petal.Length`), 
             linetype = "dashed", 
             color = "red") + 
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

The horizontal line in this chart has na intercept of 3.758, which is the mean of iris$Petal.Length, the outcome variable in our simple linear regression model. If there were not a predictor iris$Sepal.Length, the mean of $Y$ would be our best possible guess about any new value observed on that variable. Let’s take a look at the residuals of the outcome variable from its own mean:

linFit <- iris
linFit$Petal.Length.AtMean <- linFit$Petal.Length - mean(linFit$Petal.Length)
ggplot(data = linFit,
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  geom_hline(yintercept = mean(iris$`Petal.Length`), 
             linetype = "dashed", 
             color = "red") +
  geom_segment(aes(x = Sepal.Length, 
                   y = Petal.Length, 
                   xend = Sepal.Length, 
                   yend = Petal.Length - Petal.Length.AtMean),
               color = "black", size = .2, linetype = "dashed") +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "black", size = 2) +
  geom_point(aes(x = Sepal.Length, y = Petal.Length), color = "white", size = 1.5) +
  geom_point(aes(x = Sepal.Length, y = mean(iris$Petal.Length)), color = "red", size = 1) +
  xlab("Sepal.Length") + ylab("Petal.Length") +
  theme_classic() +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(strip.background = element_blank()) +
  theme(axis.text.x = element_text(size = 6)) + 
  theme(axis.text.y = element_text(size = 6))

What is our error, in total, if we start predicting the values of a variable from its mean alone, without any other predictor at hand? Enters the Total Sum of Squares, $SS_T$:

ss_total <- sum((iris$Petal.Length - mean(iris$Petal.Length))^2)
print(ss_total)

## [1] 464.3254

Ok, now back to the simple linear regression model Petal.Length ~ Sepal.Length:

linFit <- lm(data = iris,
             Petal.Length ~ Sepal.Length)
linFit <- data.frame(
  x = iris$Sepal.Length,
  y = iris$Petal.Length,
  predicted = linFit$fitted.values,
  residuals = linFit$residuals
)
ggplot(data = linFit,
       aes(x = x, y = y)) +
  geom_smooth(method = lm, se = F, color = "red", size = .25) +
  geom_segment(aes(x = x, 
                   y = predicted, 
                   xend = x, 
                   yend = predicted + residuals),
               color = "black", size = .2, linetype = "dashed") +
  geom_point(aes(x = x, y = y), color = "black", size = 2) +
  geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
  geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
  xlab("Sepal.Length") + ylab("Petal.Length") +
  theme_classic() +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(strip.background = element_blank()) +
  theme(axis.text.x = element_text(size = 6)) + 
  theme(axis.text.y = element_text(size = 6))

The Residual Sum of Squares, $SS_R$, is the sum of squared distances from the observed data points to the model’s respective predictions:

slr_model <- lm(data = iris, 
                Petal.Length ~ Sepal.Length)
ss_residual <- sum(residuals(slr_model)^2)
print(ss_residual)

## [1] 111.4592

Finally, the Model Sum of Squares, $SS_M$.

linFit <- lm(data = iris,
             Petal.Length ~ Sepal.Length)
linFit <- data.frame(
  x = iris$Sepal.Length,
  y = iris$Petal.Length,
  predicted = linFit$fitted.values,
  meanY = mean(iris$Petal.Length)
)
ggplot(data = linFit,
       aes(x = x, y = y)) +
  geom_smooth(method = lm, se = F, color = "red", size = .25) +
  geom_hline(yintercept = mean(iris$`Petal.Length`), 
             linetype = "dashed", 
             color = "red") +
  geom_segment(aes(x = x, 
                   y = predicted, 
                   xend = x, 
                   yend = meanY),
               color = "red", size = .2, linetype = "dashed") +
  geom_point(aes(x = x, y = y), color = "black", size = 2) +
  geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
  geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
  xlab("Sepal.Length") + ylab("Petal.Length") +
  theme_classic() +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(strip.background = element_blank()) +
  theme(axis.text.x = element_text(size = 6)) + 
  theme(axis.text.y = element_text(size = 6))

The Model Sum of Squares, $SS_M$, is based on the distances between the mean of the outcome and the model predictions:

ss_model <- sum((linFit$predicted - linFit$meanY)^2)
print(ss_model)

## [1] 352.8662

Now:

print(ss_total - ss_residual)

## [1] 352.8662

!!! The complete error present in an attempt to predict Petal.Length from its mean alone, $SS_T$, can be thus decomposed into $SS_R$ and $SS_M$:

\[SS_{total} = SS_{model} + SS_{residual} = SS_T = SS_M + SS_R\] #### 1.2 $F-test$, $t-test$, and $R^2$ in Simple Linear Regression

But there are more tricks waiting to be pulled, see:

slr_model <- lm(data = iris, 
                Petal.Length ~ Sepal.Length)
summary(slr_model)

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

What is $SS_M/SS_T$?

print(ss_model/ss_total)

## [1] 0.7599546

So we have: $R^2 = SS_M/SS_T$!

Now, remember that $F-test$ in the output that has been mentioned in the past?

Let’s divide $SS_M$ and $SS_R$ by their respective degrees of freedom. For the $df_R$ in the simple linear regression (as in other models as well) we take the number of observations minus the numbers of the parameters in the model (two, in our case: the intercept and the slope), while for the $df_M$ we need to take only the number predictors in the model (one, in our case):

df_residual <- dim(iris)[1] - 2 # - num.obs - num.parameters
print(paste0("df_residual is: ", df_residual))

## [1] "df_residual is: 148"

df_model <- 1 # - how many variables?
print(paste0("df_model is: ", df_model))

## [1] "df_model is: 1"

ms_model <- ss_model/df_model
ms_residual <- ss_residual/df_residual
print(paste0("MS_model is: ", ms_model))

## [1] "MS_model is: 352.866244880182"

print(paste0("MS_residual is: ", ms_residual))

## [1] "MS_residual is: 0.753102399458234"

They are now called mean squares, our $MS_M$ and $MS_R$. Ok, now:

f_test <- ms_model/ms_residual
print(f_test)

## [1] 468.5502

And know we know that

$F = MS_{Model}/MS_{Residual} = MS_M/MS_R$. The $F-test$ follows the $F-distribution$ with $df_M$ and $df_R$ degrees of freedom:

x <- rf(100000, df1 = ms_model, df2 = ms_residual)
hist(x, 
     freq = FALSE, 
     xlim = c(0,3), 
     ylim = c(0,1),
     xlab = '', 
     main = ('F-distribution with df_1 = MS_M and df_2 = MS_R degrees of freedom (df)'), 
     cex.main=0.9)
curve(df(x, df1 = 10, df2 = 20), from = 0, to = 4, n = 5000, col= 'pink', lwd=2, add = T)

And another thing to observe:

print(f_test)

## [1] 468.5502

slr_summary <- summary(slr_model)
slr_coeffs <- data.frame(slr_summary$coefficients)
print(slr_coeffs)

##               Estimate Std..Error   t.value     Pr...t..
## (Intercept)  -7.101443 0.50666229 -14.01613 6.133586e-29
## Sepal.Length  1.858433 0.08585565  21.64602 1.038667e-47

slr_ttest <- slr_coeffs$t.value[2]
print(slr_ttest^2)

## [1] 468.5502

print(f_test)

## [1] 468.5502

N.B. Complicated things have many correct interpretations and connections that exist among them. For example:

the best simple linear statistical is the one that maximizes the effect of the model ($MS_M$) while minimizing the effect of the error ($MS_R$), which means
that the best simple linear statistical model is simply the one that minimizes the error, since $SS_T = SS_M + SS_R$, and $SS_T$ is fixed since it only says about the dispersion of the outcome variable around its mean; then,
the overall effect of the model can be measured by the extent of the $F-test$, which is essentially a ratio of two variances, $MS_M$ and $MS_R$, telling us how much does the model “works” beyond the error it produces, and which is a simple mathematical transformation of the
$t-test$ of whether the model’s slope ($\beta_1$) is different from zero in which case we know that the regression line “works” because it is statistically different from a horizontal line of no correlation.

2. On Minimizing Errors in Statistical Modeling

2.1 Grid Search in the Parameter Space

I want to write a function now. The function takes the following as its inputs:

a dataset with one predictor and one outcome variable
a pair of parameters, $/beta_0$ and $/beta_1$,

and it returns the sum of squared errors (i.e. residuals) from the simple linear regression of the well known $Y = \beta_0 + \beta_1X +\epsilon$ form.

Here is what I want essentially:

d <- data.frame(x = iris$Sepal.Length,
                y = iris$Petal.Length)
residuals <- summary(lm(y ~ x, data = d))$residuals
print(sum(residuals^2))

## [1] 111.4592

But I do not want to use lm(): instead, I want to be able to specify $/beta_0$ and $/beta_1$ myself:

sse <- function(d, beta_0, beta_1) {
  predictions <- beta_0 + beta_1 * d$x
  residuals <- d$y - predictions
  return(sum(residuals^2))
}

Ok. Now, the lm() function in R can find the optimal values of the parameters $/beta_0$ and $/beta_1$, right?

slr_model <- lm(y ~ x, data = d)
coefficients(slr_model)

## (Intercept)           x 
##   -7.101443    1.858433

Let’s test our sse() function:

beta_0_test <- coefficients(slr_model)[1]
beta_1_test <- coefficients(slr_model)[2]
sse(d = d, 
    beta_0 = beta_0_test, 
    beta_1 = beta_1_test)

## [1] 111.4592

And from lm() again:

d <- data.frame(x = iris$Sepal.Length,
                y = iris$Petal.Length)
residuals <- summary(lm(y ~ x, data = d))$residuals
print(sum(residuals^2))

## [1] 111.4592

So we know that our sse() works just fine.

Now: how could have determined the optimal values - the error minimizing values - of $/beta_0$ and $/beta_1$ without relying on lm()?

One idea is to move across the space of the parameter values in small steps and compute the model error at each point in that space, for example:

test_beta_0 <- seq(-10, 10, by = .1)
test_beta_1 <- seq(-10, 10, by = .1)
model_errors <- lapply(test_beta_0, function(x) {
  return(
    rbindlist(
      lapply(test_beta_1, function(y) {
        s <- sse(d = d, beta_0 = x, beta_1 = y)
        return(data.frame(sse = s, 
                          beta_0 = x, 
                          beta_1 = y))
    }))
  )
})
model_errors <- rbindlist(model_errors)
head(model_errors)

##         sse beta_0 beta_1
##       <num>  <num>  <num>
## 1: 796216.9    -10  -10.0
## 2: 783371.7    -10   -9.9
## 3: 770631.0    -10   -9.8
## 4: 757994.7    -10   -9.7
## 5: 745462.9    -10   -9.6
## 6: 733035.6    -10   -9.5

What would be the most optimal estimates of of $/beta_0$ and $/beta_1$ then?

model_errors[which.min(model_errors$sse), ]

##         sse beta_0 beta_1
##       <num>  <num>  <num>
## 1: 111.9305   -7.3    1.9

Compare:

coefficients(slr_model)

## (Intercept)           x 
##   -7.101443    1.858433

Hm, not bad?

2.2 Sample the Parameter Space

Another idea that comes to mind is the following one: why not take a uniform random sample from the Parameter Space and check out the sse() at various points defined by their $/beta_0$ and $/beta_1$ coordinates?

sample_parameters <- data.frame(beta_0 = runif(100000, -10, 10), 
                                beta_1 = runif(100000, -10, 10))
head(sample_parameters)

##       beta_0    beta_1
## 1 -5.6901721  7.273736
## 2  4.9740709 -8.643107
## 3  0.9928063 -8.036839
## 4  5.6476493 -8.047716
## 5 -7.0712216 -4.245558
## 6  2.4955622 -5.882239

Now let’s find the model errors at each randomly sampled combination of parameters:

sample_parameters$sse <- apply(sample_parameters, 1, function(x) {
    sse(d, x[1], x[2])
})
head(sample_parameters)

##       beta_0    beta_1      sse
## 1 -5.6901721  7.273736 166999.5
## 2  4.9740709 -8.643107 375782.0
## 3  0.9928063 -8.036839 381033.4
## 4  5.6476493 -8.047716 315724.0
## 5 -7.0712216 -4.245558 194422.1
## 6  2.4955622 -5.882239 196703.9

And what would be the most optimal estimates of of $/beta_0$ and $/beta_1$ in this case?

sample_parameters[which.min(sample_parameters$sse), ]

##          beta_0   beta_1     sse
## 32621 -7.270532 1.885291 111.555

Compare:

coefficients(slr_model)

## (Intercept)           x 
##   -7.101443    1.858433

Hm?

N.B. Finding the optimal values of the model’s parameters implies some sort of search through the Paramater Space, and picking the values that minimize the model error as much as possible.

But there are better ways to do it than Grid Search or Random Sampling. And whenever that is possible, this is what we do to our statistical learning models: we optimize them.

Please pay close attention to what exactly is happening in these procedures:

the dataset d is a constant, it does not change in any of the sse() function’s run;
the parameters $/beta_0$ and $/beta_1$ vary in some way (until now: grid search or random uniform sampling), and
the sse() function does not estimate anything - it is not lm()! - but computes the model error instead. So what are we really looking for? We are looking for a way to find the minimum of our sse() function.

2.3 Optimize the Simple Linear Regression model w. base R optim()

First we need a slight rewrite of sse() only:

sse <- function(params) {
  beta_0 <- params[1]
  beta_1 <- params[2]
  
  # - MODEL IS HERE:
  predictions <- beta_0 + beta_1 * d$x
  # - MODEL ENDS HERE ^^
  
  residuals <- d$y - predictions
  return(sum(residuals^2))
}

N.B. As the dataset d is a constant, it does not play a role of an sse() function parameter anymore.

Pick some random, initial values for $/beta_0$ and $/beta_1$:

beta_0_start <- runif(1, -10, 10)
beta_1_start <- runif(1, -10, 10)
print(beta_0_start)

## [1] -1.297957

print(beta_1_start)

## [1] 1.298156

Now optimize sse():

solution <- optim(par = c(beta_0_start, beta_1_start), 
                  fn = sse, 
                  method = "Nelder-Mead",
                  lower = -Inf, upper = Inf)

print(solution$par)

## [1] -7.099650  1.858104

Compare:

coefficients(slr_model)

## (Intercept)           x 
##   -7.101443    1.858433

Now that looks great!

What are we really looking at here is this (first reducing the sample_parameters dataframe a bit… :-)

# - back to the old version of sse():
sse <- function(d, beta_0, beta_1) {
  predictions <- beta_0 + beta_1 * d$x
  residuals <- d$y - predictions
  return(sum(residuals^2))
}
# - sample parameters on a different range for plotting purposes
sample_parameters <- data.frame(beta_0 = runif(1e6, -30, 30), 
                                beta_1 = runif(1e6, -10, 10))
sample_parameters$sse <- apply(sample_parameters, 1, function(x) {
    sse(d, x[1], x[2])
})
head(sample_parameters)

##       beta_0    beta_1        sse
## 1 -10.791984  1.637388   3839.764
## 2  24.493326  9.676575 902157.853
## 3   6.908973 -9.681987 441838.578
## 4 -10.175620 -2.559840 127314.734
## 5 -18.576823 -3.927809 311159.732
## 6 -19.352316  5.397247  12044.594

# Wire‑frame 3‑D error surface with *base* R `persp()`
# ====================================================
# *Focus:* You asked for a way to change how dense—or sparse—the
# grid looks **and** to tweak the grid‑line colour.  In base R
# `persp()` that boils down to:
#   • **`res`** → how many grid divisions we sample (higher ⇒ tighter mesh)
#   • **`border` / `lwd`** → colour and thickness of those lines.
# The code below promotes all of these to the top “CONFIG” block
# so you can adjust instantly.

# ── CONFIG ──────────────────────────────────────────────
phi_angle   <- 35     # altitude camera angle
theta_angle <- 40     # azimuth camera angle
expand_fac  <- 0.8    # zoom (<1 zooms out, >1 zooms in)

res         <- 50    # grid resolution: 100=sparse · 300=dense
border_col  <- "grey30"  # mesh line colour (single value)
lwd_grid    <- 0.4    # line width (≈0.2 hairline, 1 thick)

# Font scaling (1 = default)
cex_axis <- 0.7   # tick‑label size
cex_lab  <- 0.8   # axis‑title size
cex_main <- 0.85  # main‑title size

# ── 1.  Regularise (β₀, β₁, SSE) into a matrix ─────────
# Assumes `sample_parameters` with columns: beta_0, beta_1, sse

x_vals <- seq(min(sample_parameters$beta_0),
              max(sample_parameters$beta_0), length.out = res)
y_vals <- seq(min(sample_parameters$beta_1),
              max(sample_parameters$beta_1), length.out = res)

grid_df <- expand.grid(beta_0 = x_vals, beta_1 = y_vals)

nearest_id <- RANN::nn2(sample_parameters[, c("beta_0", "beta_1")],
                        grid_df, k = 1)$nn.idx
grid_df$sse <- sample_parameters$sse[nearest_id]

z_mat <- matrix(grid_df$sse,
                nrow = length(y_vals), ncol = length(x_vals), byrow = TRUE)

# ── 2.  Plot with `persp()` ─────────────────────────────
# Optional device for saving:
# png("error_landscape_wireframe.png", width = 7, height = 6, units = "in", res = 300)

old_par <- par(no.readonly = TRUE)   # save graphics settings
par(cex.axis = cex_axis,
    cex.lab  = cex_lab,
    cex.main = cex_main)

persp(x       = x_vals,
      y       = y_vals,
      z       = z_mat,
      theta   = theta_angle,
      phi     = phi_angle,
      expand  = expand_fac,
      col     = NA,             # facets transparent
      border  = border_col,     # mesh line colour
      lwd     = lwd_grid,       # mesh line width
      ticktype = "detailed",
      xlab    = "Beta 0",
      ylab    = "Beta 1",
      zlab    = "SSE",
      main    = "Simple Linear Regression Error Landscape")

par(old_par)  # restore previous settings
# dev.off()

# ── Tips ────────────────────────────────────────────────
# • **Density:** Set `res` to 100–150 for a coarse grid; 300–400 for
#   publication‑quality fine wire.
# • **Colour:** `border_col` accepts any single R colour (e.g.,
#   "black", "dodgerblue", rgb()).  Per‑facet colouring isn’t
#   available in base `persp()`; you’d need rgl or lattice for that.
# • **Thickness:** `lwd_grid` controls the weight of grid lines.
# • Remember that very high `res` (>500×500) can slow plotting and
#   eat memory; adjust to your machine’s limits.

sample_parameters[which.min(sample_parameters$sse), ]

##           beta_0   beta_1      sse
## 834649 -7.013697 1.840932 111.5221

Compare:

coefficients(slr_model)

## (Intercept)           x 
##   -7.101443    1.858433

R Markdown

R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.

Session08 Estimation Theory - Least Squares

Goran S. Milovanovic, PhD

The logic of statistical modeling explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization.

What do we want to do today?

0. Prerequisits

1. Understanding Linear Regression

1.1 Sums of Squares in Simple Linear Regression

2. On Minimizing Errors in Statistical Modeling

2.1 Grid Search in the Parameter Space

2.2 Sample the Parameter Space

2.3 Optimize the Simple Linear Regression model w. base R optim()

Further Readings

R Markdown