Session 10 Multiple Linear Regression
Goran S. Milovanovic, PhD
Session 10: Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R. Nested models.
Feedback should be send to
goran.milovanovic@datakolektiv.com
. These notebooks
accompany the Intro to Data Science: Non-Technical Background course
2020/21.
What do we want to do today?
In today’s session we need to go beyond the discussion of a simple relationship between two variables: one predictor and one criterion. While Simple Linear Regression is extremely useful in a didactic perspective - to introduce statistical learning methods as such - is is only seldom used in practice. The real world is complex way beyond exploring only simple relationships, and mathematical models in practice almost necessarily deal with many predictors and the existing mutual information among them in an attempt to predict the state of the outcome variable. We will introduce the Multiple Linear Regression model in this session and discuss a more complex scenario involving several predictors. We will also introduce the method of dummy coding when categorical predictors are present in the Multiple Linear Regression scenario. We are also laying ground for even more complex Generalized Linear Models that can handle categorization problems beyond regression. Finally: comparing nested regression models.
0. Prerequisits
Install:
install.packages('lattice')
Setup:
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(Hmisc)
library(ppcor)
library(car)
library(datasets)
library(broom)
library(lattice)
1. Multiple Linear Regression: the exposition of the problem
1.1 Iris, again
The `iris’ dataset again offers everything that we need to introduce a new statistical model. One might wonder, but that small and (manifestly!) simple dataset can be used as well to introduce even more complex statistical models than Multiple Linear Regression!
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
And back to the problem with Petal.Length ~ Sepal.Length
regression that we have already discussed:
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length)) +
geom_point(size = 2, colour = "blue") +
geom_smooth(method='lm', size = .25) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
… where everything looks nice until…
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species,
group = Species)) +
geom_point(size = 2) +
geom_smooth(method='lm', size = .25) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
Not to mention the model assumptions that we have discussed in the previous Session.
Let’s study the problem a bit.
plot(iris[ , c(1,3,5)],
main =
"Inspect: Sepal vs. Petal Length \nfollowing the discovery of the Species...",
cex.main = .75,
cex = .6)
Did we ever mention {lattice}
, the hero of data
visualization in R before {ggplot2}
?
# - {latice} xyplot
xyplot(Petal.Length ~ Sepal.Length | Species,
data = iris,
xlab = "Sepal Length",
ylab = "Petal Length"
)
Consider the conditional densities of
Petal.Length
given Species
(using
{lattice}
again):
# - {latice} densityplot
densityplot(~ Petal.Length | Species,
data = iris,
plot.points = FALSE,
xlab = "Petal Length",
ylab = "Density",
main = "P(Petal Length|Species)",
col.line = 'red'
)
And now consider the conditional densities of
Sepal.Length
given Species
:
# - {latice} densityplot
densityplot(~ Sepal.Length | Species,
data = iris,
plot.points=FALSE,
xlab = "Sepal Length", ylab = "Density",
main = "P(Sepal Length|Species)",
col.line = 'blue'
)
I have and idea: why not run a series of separate Simple Linear
Regressions in the subgroups defined by Species
and inspect
the results? Let’s do it:
# - setosa
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1])
reg <- lm(Petal.Length ~ Sepal.Length,
data = iris[w1, ])
tidy(reg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.803 0.344 2.34 0.0238
## 2 Sepal.Length 0.132 0.0685 1.92 0.0607
# - versicolor
w2 <- which(iris$Species == species[2])
reg <- lm(Petal.Length ~ Sepal.Length, data = iris[w2,])
tidy(reg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.185 0.514 0.360 7.20e- 1
## 2 Sepal.Length 0.686 0.0863 7.95 2.59e-10
# - virginica
w3 <- which(iris$Species == species[3])
reg <- lm(Petal.Length ~ Sepal.Length, data = iris[w3,])
tidy(reg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.610 0.417 1.46 1.50e- 1
## 2 Sepal.Length 0.750 0.0630 11.9 6.30e-16
I have used broom::tidy()
to tidy up the model
summaries. The {broom}
package offers many useful functions to deal with potentially messy
outputs of R’s modeling functions such as lm()
.
So, Species
obviously has some effect on
Petal.Length
, and that effect possibly goes even beyond the
effect of Sepal.Length
. How do we incorporate another
predictor into the regression model?
The broom::glance()
function is similar to
summary()
but gives us the model overview all tidy:
broom::glance(reg)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.747 0.742 0.281 142. 6.30e-16 1 -6.37 18.7 24.5 3.78 48 50
1.2 The predictor is categorical: Dummy Coding
We will now try to predict Petal.Length
from
Species
alone in a Simple Linear Regression model.
First:
is.factor(iris$Species)
## [1] TRUE
Ok, and the levels?
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
Regression with one categorical predictor:
reg <- lm(Petal.Length ~ Species,
data = iris)
tidy(reg)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.46 0.0609 24.0 9.30e-53
## 2 Speciesversicolor 2.80 0.0861 32.5 5.25e-69
## 3 Speciesvirginica 4.09 0.0861 47.5 4.11e-91
What effects are present? Let’s see: Speciesversicolor
,
Speciesvirginica
, ok, but what happened to
Speciessetosa
..? It is our baseline,
see:
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
Never forget what the regression coefficient of a dummy
variable means: it tells us about the effect of moving from
the baseline towards the respective reference level. Here:
baseline = setosa
(cmp. levels(iris$Species)
vs. the output of tidy(reg)
). Hence:
always look after the order of levels in linear models!
For example, we can change the baseline in Species
to
versicolor
:
# - Levels: setosa versicolor virginica
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
iris$Species <- factor(iris$Species,
levels = c("versicolor",
"virginica",
"setosa")
)
levels(iris$Species)
## [1] "versicolor" "virginica" "setosa"
Regression again:
# - baseline is now: versicolor
reg <- lm(Petal.Length ~ Species,
data = iris)
tidy(reg)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.26 0.0609 70.0 8.91e-115
## 2 Speciesvirginica 1.29 0.0861 15.0 1.81e- 31
## 3 Speciessetosa -2.80 0.0861 -32.5 5.25e- 69
1.3 Understanding dummy coding
Here is another way to perform dummy coding of categorical variables in R:
# - ...just to fix the order of Species back to default
rm(iris); data(iris)
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
In order to understand what dummy coding really is:
contr.treatment(3, base = 1)
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
And then specifically applied to iris$Species
:
contrasts(iris$Species) <- contr.treatment(3, base = 1)
contrasts(iris$Species)
## 2 3
## setosa 0 0
## versicolor 1 0
## virginica 0 1
Do not forget that:
class(iris$Species)
## [1] "factor"
Now let’s play with level ordering:
iris$Species <- factor(iris$Species,
levels = c ("virginica",
"versicolor",
"setosa"))
levels(iris$Species)
## [1] "virginica" "versicolor" "setosa"
contrasts(iris$Species) = contr.treatment(3, base = 1)
# - baseline is now: virginica
contrasts(iris$Species)
## 2 3
## virginica 0 0
## versicolor 1 0
## setosa 0 1
# - consider carefully what you need to do
levels(iris$Species)
## [1] "virginica" "versicolor" "setosa"
2. Multiple Linear Regression: the problem solved
2.1 One categorical + one continuous predictor
Now we run a multiple linear regression model with
Sepal.Length
and Species
(dummy coded) as
predictors of Petal.Length
:
# - Petal.Length ~ Species (Dummy Coding) + Sepal.Length
rm(iris); data(iris) # ...just to fix the order of Species back to default
reg <- lm(Petal.Length ~ Species + Sepal.Length,
data = iris)
tidy(reg)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.70 0.230 -7.40 1.01e-11
## 2 Speciesversicolor 2.21 0.0705 31.4 9.65e-67
## 3 Speciesvirginica 3.09 0.0912 33.9 4.92e-71
## 4 Sepal.Length 0.632 0.0453 14.0 1.12e-28
glance(reg)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.975 0.974 0.283 1890. 1.46e-116 3 -21.2 52.5 67.5 11.7 146 150
N.B. Since is.factor
(iris$Species) == T
- R does the dummy coding in lm()
internally for us!
Let’s now compare these results with the simple linear regression model:
reg <- lm(Petal.Length ~ Sepal.Length, data=iris)
tidy(reg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -7.10 0.507 -14.0 6.13e-29
## 2 Sepal.Length 1.86 0.0859 21.6 1.04e-47
glance(reg)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.760 0.758 0.868 469. 1.04e-47 1 -191. 387. 396. 111. 148 150
2.2 Nested models
We will now specify two regression models: reg1
defined
as Petal.Length ~ Sepal.Length
and reg2
defined as Petal.Length ~ Species + Sepal.Length
.
Obviously, reg2
encompasses reg1
in some way,
right? Of course: the predictors in one model are a subset of predictors
in another. Such models are called nested models. In this
terminological game, reg2
would also be called a
full model: a terminology will be used quite often in
Binary Logistic Regression, the first Generalized Linear Model that we
will meet in our next session.
Note on nested models: There is always a set of
coefficients for the nested model (e.g. reg1
) such that it
can be expressed in terms of the full model (reg2
). Can you
figure it out?
# - reg1 is nested under reg2
reg1 <- lm(Petal.Length ~ Sepal.Length,
data = iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length,
data = iris)
We can use the partial F-test to compare nested models:
anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length
## Analysis of Variance Table
##
## Model 1: Petal.Length ~ Sepal.Length
## Model 2: Petal.Length ~ Species + Sepal.Length
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 111.459
## 2 146 11.657 2 99.802 624.99 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.3 Model diagnostics
We can use the same kind of Influence Plot to search for influential cases in Multiple Linear Regression as we did in the case of Simple Linear Regression (except for this time the computation of the relevant indicators is way more complicated):
infReg <- as.data.frame(influence.measures(reg2)$infmat)
head(infReg)
## dfb.1_ dfb.Spcsvrs dfb.Spcsvrg dfb.Sp.L dffit cov.r cook.d hat
## 1 -0.0042534291 0.039006718 0.0322621358 -0.0065866808 -0.062200067 1.043687 9.726254e-04 0.02022682
## 2 0.0007436033 -0.001261697 -0.0008756326 -0.0003059107 0.002565673 1.049149 1.657015e-06 0.02028843
## 3 0.0082584030 -0.005771089 -0.0026594814 -0.0055585852 0.016970329 1.051064 7.248827e-05 0.02240362
## 4 0.0947351624 -0.044182086 -0.0116298931 -0.0695391019 0.166410161 1.021607 6.917630e-03 0.02423132
## 5 -0.0053433214 0.016680580 0.0128198872 0.0002014036 -0.029629847 1.047607 2.209301e-04 0.02000092
## 6 0.0015019224 0.004708480 0.0044527232 -0.0025208970 -0.006184663 1.053087 9.628358e-06 0.02398489
This time we will use broom:augment()
to obtain the
influence measures:
regFrame <- broom::augment(reg2)
head(regFrame)
## # A tibble: 6 × 9
## Petal.Length Species Sepal.Length .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.4 setosa 5.1 1.52 -0.121 0.0202 0.283 0.000973 -0.434
## 2 1.4 setosa 4.9 1.39 0.00500 0.0203 0.284 0.00000166 0.0179
## 3 1.3 setosa 4.7 1.27 0.0314 0.0224 0.284 0.0000725 0.112
## 4 1.5 setosa 4.6 1.21 0.295 0.0242 0.282 0.00692 1.06
## 5 1.4 setosa 5 1.46 -0.0582 0.0200 0.283 0.000221 -0.208
## 6 1.7 setosa 5.4 1.71 -0.0111 0.0240 0.284 0.00000963 -0.0396
Produce the Influence Chart:
plotFrame <- data.frame(residual = regFrame$.std.resid,
leverage = regFrame$.hat,
cookD = regFrame$.cooksd)
ggplot(plotFrame,
aes(y = residual,
x = leverage)) +
geom_point(size = plotFrame$cookD * 100,
shape = 1, color = "blue") +
ylab("Standardized Residual") +
xlab("Leverage") +
ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(size = 8,
face = "bold",
hjust = .5))
3. Several continuous predictors
3.1 The stackloss
problem
The following example is a modification of the multiple-linear-regression section from R Tutorial.
data(stackloss)
str(stackloss)
## 'data.frame': 21 obs. of 4 variables:
## $ Air.Flow : num 80 80 75 62 62 62 62 62 58 58 ...
## $ Water.Temp: num 27 27 25 24 22 23 24 24 23 18 ...
## $ Acid.Conc.: num 89 88 90 87 87 87 93 93 87 80 ...
## $ stack.loss: num 42 37 37 28 18 18 19 20 15 14 ...
The description of the stackloss
dataset is found in the
documentation:
Water Temp
is the temperature of cooling water circulated through coils in the absorption tower;Air Flow
is the flow of cooling air;Acid Conc.
is the concentration of the acid circulating;stack.loss
(the outcome variable) is 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption column unabsorbed; that is, an (inverse) measure of the overall efficiency of the plant.
stacklossModel <- lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
data = stackloss)
summary(stacklossModel)
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
glance(stacklossModel)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.914 0.898 3.24 59.9 0.00000000302 3 -52.3 115. 120. 179. 17 21
tidy(stacklossModel)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -39.9 11.9 -3.36 0.00375
## 2 Air.Flow 0.716 0.135 5.31 0.0000580
## 3 Water.Temp 1.30 0.368 3.52 0.00263
## 4 Acid.Conc. -0.152 0.156 -0.973 0.344
Prediction for one single new data point:
# predict new data
obs = data.frame(Air.Flow = 72,
Water.Temp = 20,
Acid.Conc. = 85)
predict(stacklossModel,
obs)
## 1
## 24.58173
The confint()
functions works as usual, for 95% CI…
confint(stacklossModel, level = .95) # 95% CI
## 2.5 % 97.5 %
## (Intercept) -65.0180339 -14.8213150
## Air.Flow 0.4311143 1.0001661
## Water.Temp 0.5188228 2.0717495
## Acid.Conc. -0.4818741 0.1776291
… as well as for %99 CI:
confint(stacklossModel, level = .99) # 99% CI
## 0.5 % 99.5 %
## (Intercept) -74.3970156 -5.4423333
## Air.Flow 0.3247901 1.1064903
## Water.Temp 0.2286670 2.3619053
## Acid.Conc. -0.6050987 0.3008536
3.2 Multicolinearity in Multiple Regression
That crazy thing with multiple regression: if the predictors are not
correlated at all, why not run a series of simple linear regressions? On
the other hand, if the predictors are highly correlated, problems with
the estimates arise… John Fox’s {car}
package allows us to
compute the Variance Inflation Factor quite easily:
VIF <- vif(stacklossModel)
VIF
## Air.Flow Water.Temp Acid.Conc.
## 2.906484 2.572632 1.333587
The Variance Inflation Factor (VIF) measures the increase in the
variance of a regression coefficient due to colinearity. It’s
square root (sqrt(VIF)
) tells us how much larger a standard
error of a regression coefficient is compared to a hypothetical
situation where there were no correlations with any other predictors in
the model. NOTE: The lower bound of VIF is 1; there is
no upper bound, but VIF > 2 typically indicates that one should be
concerned.
sqrt(VIF)
## Air.Flow Water.Temp Acid.Conc.
## 1.704841 1.603943 1.154811
3.3 Part correlation in multiple regression
In multiple regression, it is the semi-partial (or part) correlation that you need to inspect:
assume a model with
X1
,X2
,X3
as predictors, andY
as a criterion;you need a semi-partial of
X1
andY
following the removal ofX2
andX3
fromY
;it goes like this: in Step 1, you perform a multiple regression
Y ~ X2 + X3
;In Step 2, you take the residuals of
Y
, call themRY
;in Step 3, you regress (correlate)
RY ~ X1
: the correlation coefficient that you get from Step 3 is the part correlation that you’re looking for!
Recall our model…
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
data = stackloss)
summary(stacklossModel)
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
What is the semi-partial (part correlation) of
stack.loss
and Air.Flow
? Remember the {pcorr}
package?
spartCor1 <- spcor.test(x = stackloss$Air.Flow,
y = stackloss$stack.loss,
z = stackloss[ , c("Water.Temp", "Acid.Conc.")],
method = "pearson")
print(spartCor1)
## estimate p.value statistic n gp Method
## 1 0.4631864 0.04580372 2.154858 21 2 pearson
The unique contribution of Air.Flow
is then:
spartCor1$estimate
## [1] 0.4631864
spartCor1$statistic
## [1] 2.154858
spartCor1$p.value
## [1] 0.04580372
Further Readings
R Markdown
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.