Session 10 Multiple Linear Regression

Session 10: Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R. Nested models.

Feedback should be send to goran.milovanovic@datakolektiv.com. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.

What do we want to do today?

In today’s session we need to go beyond the discussion of a simple relationship between two variables: one predictor and one criterion. While Simple Linear Regression is extremely useful in a didactic perspective - to introduce statistical learning methods as such - is is only seldom used in practice. The real world is complex way beyond exploring only simple relationships, and mathematical models in practice almost necessarily deal with many predictors and the existing mutual information among them in an attempt to predict the state of the outcome variable. We will introduce the Multiple Linear Regression model in this session and discuss a more complex scenario involving several predictors. We will also introduce the method of dummy coding when categorical predictors are present in the Multiple Linear Regression scenario. We are also laying ground for even more complex Generalized Linear Models that can handle categorization problems beyond regression. Finally: comparing nested regression models.

0. Prerequisits

Install:

install.packages('lattice')

Setup:

dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(Hmisc)
library(ppcor)
library(car)
library(datasets)
library(broom)
library(lattice)

1. Multiple Linear Regression: the exposition of the problem

1.1 Iris, again

The `iris’ dataset again offers everything that we need to introduce a new statistical model. One might wonder, but that small and (manifestly!) simple dataset can be used as well to introduce even more complex statistical models than Multiple Linear Regression!

data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

And back to the problem with Petal.Length ~ Sepal.Length regression that we have already discussed:

ggplot(data = iris,
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  geom_point(size = 2, colour = "blue") +
  geom_smooth(method='lm', size = .25) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

… where everything looks nice until…

ggplot(data = iris,
       aes(x = Sepal.Length, 
           y = Petal.Length, 
           color = Species, 
           group = Species)) +
  geom_point(size = 2) +
  geom_smooth(method='lm', size = .25) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

Not to mention the model assumptions that we have discussed in the previous Session.

Let’s study the problem a bit.

plot(iris[ , c(1,3,5)],
     main = 
       "Inspect: Sepal vs. Petal Length \nfollowing the discovery of the Species...",
     cex.main = .75,
     cex = .6)

Did we ever mention {lattice}, the hero of data visualization in R before {ggplot2}?

# - {latice} xyplot
xyplot(Petal.Length ~ Sepal.Length | Species,
       data = iris,
       xlab = "Sepal Length", 
       ylab = "Petal Length"
)

Consider the conditional densities of Petal.Length given Species (using {lattice} again):

# - {latice} densityplot
densityplot(~ Petal.Length | Species,
            data = iris,
            plot.points = FALSE,
            xlab = "Petal Length", 
            ylab = "Density",
            main = "P(Petal Length|Species)",
            col.line = 'red'
)

And now consider the conditional densities of Sepal.Length given Species:

# - {latice} densityplot
densityplot(~ Sepal.Length | Species,
            data = iris,
            plot.points=FALSE,
            xlab = "Sepal Length", ylab = "Density",
            main = "P(Sepal Length|Species)",
            col.line = 'blue'
)

I have and idea: why not run a series of separate Simple Linear Regressions in the subgroups defined by Species and inspect the results? Let’s do it:

# - setosa
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1])
reg <- lm(Petal.Length ~ Sepal.Length, 
          data = iris[w1, ]) 
tidy(reg)

## # A tibble: 2 × 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     0.803    0.344       2.34  0.0238
## 2 Sepal.Length    0.132    0.0685      1.92  0.0607

# - versicolor
w2 <- which(iris$Species == species[2])
reg <- lm(Petal.Length ~ Sepal.Length, data = iris[w2,]) 
tidy(reg)

## # A tibble: 2 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     0.185    0.514      0.360 7.20e- 1
## 2 Sepal.Length    0.686    0.0863     7.95  2.59e-10

# - virginica
w3 <- which(iris$Species == species[3])
reg <- lm(Petal.Length ~ Sepal.Length, data = iris[w3,]) 
tidy(reg)

## # A tibble: 2 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     0.610    0.417       1.46 1.50e- 1
## 2 Sepal.Length    0.750    0.0630     11.9  6.30e-16

I have used broom::tidy() to tidy up the model summaries. The {broom} package offers many useful functions to deal with potentially messy outputs of R’s modeling functions such as lm().

So, Species obviously has some effect on Petal.Length, and that effect possibly goes even beyond the effect of Sepal.Length. How do we incorporate another predictor into the regression model?

The broom::glance() function is similar to summary() but gives us the model overview all tidy:

broom::glance(reg)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.747         0.742 0.281      142. 6.30e-16     1  -6.37  18.7  24.5     3.78          48    50

1.2 The predictor is categorical: Dummy Coding

We will now try to predict Petal.Length from Species alone in a Simple Linear Regression model. First:

is.factor(iris$Species)

## [1] TRUE

Ok, and the levels?

levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

Regression with one categorical predictor:

reg <- lm(Petal.Length ~ Species, 
          data = iris) 
tidy(reg)

## # A tibble: 3 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           1.46    0.0609      24.0 9.30e-53
## 2 Speciesversicolor     2.80    0.0861      32.5 5.25e-69
## 3 Speciesvirginica      4.09    0.0861      47.5 4.11e-91

What effects are present? Let’s see: Speciesversicolor, Speciesvirginica, ok, but what happened to Speciessetosa..? It is our baseline, see:

levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

Never forget what the regression coefficient of a dummy variable means: it tells us about the effect of moving from the baseline towards the respective reference level. Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg)). Hence: always look after the order of levels in linear models!

For example, we can change the baseline in Species to versicolor:

# - Levels: setosa versicolor virginica
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

iris$Species <- factor(iris$Species, 
                       levels = c("versicolor", 
                                  "virginica",
                                  "setosa")
                       )
levels(iris$Species)

## [1] "versicolor" "virginica"  "setosa"

Regression again:

# - baseline is now: versicolor
reg <- lm(Petal.Length ~ Species, 
          data = iris) 
tidy(reg)

## # A tibble: 3 × 5
##   term             estimate std.error statistic   p.value
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)          4.26    0.0609      70.0 8.91e-115
## 2 Speciesvirginica     1.29    0.0861      15.0 1.81e- 31
## 3 Speciessetosa       -2.80    0.0861     -32.5 5.25e- 69

1.3 Understanding dummy coding

Here is another way to perform dummy coding of categorical variables in R:

# - ...just to fix the order of Species back to default
rm(iris); data(iris)
levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

In order to understand what dummy coding really is:

contr.treatment(3, base = 1)

##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

And then specifically applied to iris$Species:

contrasts(iris$Species) <- contr.treatment(3, base = 1)
contrasts(iris$Species)

##            2 3
## setosa     0 0
## versicolor 1 0
## virginica  0 1

Do not forget that:

class(iris$Species)

## [1] "factor"

Now let’s play with level ordering:

iris$Species <- factor(iris$Species, 
                       levels = c ("virginica", 
                                   "versicolor", 
                                   "setosa"))
levels(iris$Species)

## [1] "virginica"  "versicolor" "setosa"

contrasts(iris$Species) = contr.treatment(3, base = 1)
# - baseline is now: virginica
contrasts(iris$Species)

##            2 3
## virginica  0 0
## versicolor 1 0
## setosa     0 1

# - consider carefully what you need to do

levels(iris$Species)

## [1] "virginica"  "versicolor" "setosa"

2. Multiple Linear Regression: the problem solved

2.1 One categorical + one continuous predictor

Now we run a multiple linear regression model with Sepal.Length and Species (dummy coded) as predictors of Petal.Length:

# - Petal.Length ~ Species (Dummy Coding) + Sepal.Length 
rm(iris); data(iris) # ...just to fix the order of Species back to default
reg <- lm(Petal.Length ~ Species + Sepal.Length, 
          data = iris)
tidy(reg)

## # A tibble: 4 × 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         -1.70     0.230      -7.40 1.01e-11
## 2 Speciesversicolor    2.21     0.0705     31.4  9.65e-67
## 3 Speciesvirginica     3.09     0.0912     33.9  4.92e-71
## 4 Sepal.Length         0.632    0.0453     14.0  1.12e-28

glance(reg)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.975         0.974 0.283     1890. 1.46e-116     3  -21.2  52.5  67.5     11.7         146   150

N.B. Since is.factor (iris$Species) == T - R does the dummy coding in lm() internally for us!

Let’s now compare these results with the simple linear regression model:

reg <- lm(Petal.Length ~ Sepal.Length, data=iris) 
tidy(reg)

## # A tibble: 2 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     -7.10    0.507      -14.0 6.13e-29
## 2 Sepal.Length     1.86    0.0859      21.6 1.04e-47

glance(reg)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.760         0.758 0.868      469. 1.04e-47     1  -191.  387.  396.     111.         148   150

2.2 Nested models

We will now specify two regression models: reg1 defined as Petal.Length ~ Sepal.Length and reg2 defined as Petal.Length ~ Species + Sepal.Length. Obviously, reg2 encompasses reg1 in some way, right? Of course: the predictors in one model are a subset of predictors in another. Such models are called nested models. In this terminological game, reg2 would also be called a full model: a terminology will be used quite often in Binary Logistic Regression, the first Generalized Linear Model that we will meet in our next session.

Note on nested models: There is always a set of coefficients for the nested model (e.g. reg1) such that it can be expressed in terms of the full model (reg2). Can you figure it out?

# - reg1 is nested under reg2
reg1 <- lm(Petal.Length ~ Sepal.Length, 
           data = iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, 
           data = iris)

We can use the partial F-test to compare nested models:

anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length

## Analysis of Variance Table
## 
## Model 1: Petal.Length ~ Sepal.Length
## Model 2: Petal.Length ~ Species + Sepal.Length
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    148 111.459                                  
## 2    146  11.657  2    99.802 624.99 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.3 Model diagnostics

We can use the same kind of Influence Plot to search for influential cases in Multiple Linear Regression as we did in the case of Simple Linear Regression (except for this time the computation of the relevant indicators is way more complicated):

infReg <- as.data.frame(influence.measures(reg2)$infmat)
head(infReg)

##          dfb.1_  dfb.Spcsvrs   dfb.Spcsvrg      dfb.Sp.L        dffit    cov.r       cook.d        hat
## 1 -0.0042534291  0.039006718  0.0322621358 -0.0065866808 -0.062200067 1.043687 9.726254e-04 0.02022682
## 2  0.0007436033 -0.001261697 -0.0008756326 -0.0003059107  0.002565673 1.049149 1.657015e-06 0.02028843
## 3  0.0082584030 -0.005771089 -0.0026594814 -0.0055585852  0.016970329 1.051064 7.248827e-05 0.02240362
## 4  0.0947351624 -0.044182086 -0.0116298931 -0.0695391019  0.166410161 1.021607 6.917630e-03 0.02423132
## 5 -0.0053433214  0.016680580  0.0128198872  0.0002014036 -0.029629847 1.047607 2.209301e-04 0.02000092
## 6  0.0015019224  0.004708480  0.0044527232 -0.0025208970 -0.006184663 1.053087 9.628358e-06 0.02398489

This time we will use broom:augment() to obtain the influence measures:

regFrame <- broom::augment(reg2)
head(regFrame)

## # A tibble: 6 × 9
##   Petal.Length Species Sepal.Length .fitted   .resid   .hat .sigma    .cooksd .std.resid
##          <dbl> <fct>          <dbl>   <dbl>    <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
## 1          1.4 setosa           5.1    1.52 -0.121   0.0202  0.283 0.000973      -0.434 
## 2          1.4 setosa           4.9    1.39  0.00500 0.0203  0.284 0.00000166     0.0179
## 3          1.3 setosa           4.7    1.27  0.0314  0.0224  0.284 0.0000725      0.112 
## 4          1.5 setosa           4.6    1.21  0.295   0.0242  0.282 0.00692        1.06  
## 5          1.4 setosa           5      1.46 -0.0582  0.0200  0.283 0.000221      -0.208 
## 6          1.7 setosa           5.4    1.71 -0.0111  0.0240  0.284 0.00000963    -0.0396

Produce the Influence Chart:

plotFrame <- data.frame(residual = regFrame$.std.resid,
                        leverage = regFrame$.hat,
                        cookD = regFrame$.cooksd)
ggplot(plotFrame,
       aes(y = residual,
           x = leverage)) +
  geom_point(size = plotFrame$cookD * 100, 
             shape = 1, color = "blue") +
  ylab("Standardized Residual") + 
  xlab("Leverage") +
  ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(size = 8, 
                                  face = "bold", 
                                  hjust = .5))

3. Several continuous predictors

3.1 The `stackloss` problem

The following example is a modification of the multiple-linear-regression section from R Tutorial.

data(stackloss)
str(stackloss)

## 'data.frame':    21 obs. of  4 variables:
##  $ Air.Flow  : num  80 80 75 62 62 62 62 62 58 58 ...
##  $ Water.Temp: num  27 27 25 24 22 23 24 24 23 18 ...
##  $ Acid.Conc.: num  89 88 90 87 87 87 93 93 87 80 ...
##  $ stack.loss: num  42 37 37 28 18 18 19 20 15 14 ...

The description of the stackloss dataset is found in the documentation:

Water Temp is the temperature of cooling water circulated through coils in the absorption tower;
Air Flow is the flow of cooling air;
Acid Conc. is the concentration of the acid circulating;
stack.loss (the outcome variable) is 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption column unabsorbed; that is, an (inverse) measure of the overall efficiency of the plant.

stacklossModel <- lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
                     data = stackloss)
summary(stacklossModel)

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

glance(stacklossModel)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic       p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>         <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.914         0.898  3.24      59.9 0.00000000302     3  -52.3  115.  120.     179.          17    21

tidy(stacklossModel)

## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -39.9      11.9      -3.36  0.00375  
## 2 Air.Flow       0.716     0.135     5.31  0.0000580
## 3 Water.Temp     1.30      0.368     3.52  0.00263  
## 4 Acid.Conc.    -0.152     0.156    -0.973 0.344

Prediction for one single new data point:

# predict new data
obs = data.frame(Air.Flow = 72, 
                 Water.Temp = 20, 
                 Acid.Conc. = 85)
predict(stacklossModel,
        obs)

##        1 
## 24.58173

The confint() functions works as usual, for 95% CI…

confint(stacklossModel, level = .95) # 95% CI

##                   2.5 %      97.5 %
## (Intercept) -65.0180339 -14.8213150
## Air.Flow      0.4311143   1.0001661
## Water.Temp    0.5188228   2.0717495
## Acid.Conc.   -0.4818741   0.1776291

… as well as for %99 CI:

confint(stacklossModel, level = .99) # 99% CI

##                   0.5 %     99.5 %
## (Intercept) -74.3970156 -5.4423333
## Air.Flow      0.3247901  1.1064903
## Water.Temp    0.2286670  2.3619053
## Acid.Conc.   -0.6050987  0.3008536

3.2 Multicolinearity in Multiple Regression

That crazy thing with multiple regression: if the predictors are not correlated at all, why not run a series of simple linear regressions? On the other hand, if the predictors are highly correlated, problems with the estimates arise… John Fox’s {car} package allows us to compute the Variance Inflation Factor quite easily:

VIF <- vif(stacklossModel)
VIF

##   Air.Flow Water.Temp Acid.Conc. 
##   2.906484   2.572632   1.333587

The Variance Inflation Factor (VIF) measures the increase in the variance of a regression coefficient due to colinearity. It’s square root (sqrt(VIF)) tells us how much larger a standard error of a regression coefficient is compared to a hypothetical situation where there were no correlations with any other predictors in the model. NOTE: The lower bound of VIF is 1; there is no upper bound, but VIF > 2 typically indicates that one should be concerned.

sqrt(VIF)

##   Air.Flow Water.Temp Acid.Conc. 
##   1.704841   1.603943   1.154811

3.3 Part correlation in multiple regression

In multiple regression, it is the semi-partial (or part) correlation that you need to inspect:

assume a model with X1, X2, X3 as predictors, and Y as a criterion;
you need a semi-partial of X1 and Y following the removal of X2 and X3 from Y;
it goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
In Step 2, you take the residuals of Y, call them RY;
in Step 3, you regress (correlate) RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation that you’re looking for!

Recall our model…

stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
                    data = stackloss)
summary(stacklossModel)

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

What is the semi-partial (part correlation) of stack.loss and Air.Flow? Remember the {pcorr} package?

spartCor1 <- spcor.test(x = stackloss$Air.Flow, 
                        y = stackloss$stack.loss,
                        z = stackloss[ , c("Water.Temp", "Acid.Conc.")],
                        method = "pearson")
print(spartCor1)

##    estimate    p.value statistic  n gp  Method
## 1 0.4631864 0.04580372  2.154858 21  2 pearson

The unique contribution of Air.Flow is then:

spartCor1$estimate

## [1] 0.4631864

spartCor1$statistic

## [1] 2.154858

spartCor1$p.value

## [1] 0.04580372

R Markdown

R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.

Session 10 Multiple Linear Regression

Goran S. Milovanovic, PhD

Session 10: Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R. Nested models.

What do we want to do today?

0. Prerequisits

1. Multiple Linear Regression: the exposition of the problem

1.1 Iris, again

1.2 The predictor is categorical: Dummy Coding

1.3 Understanding dummy coding

2. Multiple Linear Regression: the problem solved

2.1 One categorical + one continuous predictor

2.2 Nested models

2.3 Model diagnostics

3. Several continuous predictors

3.1 The `stackloss` problem

3.2 Multicolinearity in Multiple Regression

3.3 Part correlation in multiple regression

Further Readings

R Markdown

Session 10 Multiple Linear Regression

Goran S. Milovanovic, PhD

Session 10: Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R. Nested models.

What do we want to do today?

0. Prerequisits

1. Multiple Linear Regression: the exposition of the problem

1.1 Iris, again

1.2 The predictor is categorical: Dummy Coding

1.3 Understanding dummy coding

2. Multiple Linear Regression: the problem solved

2.1 One categorical + one continuous predictor

2.2 Nested models

2.3 Model diagnostics

3. Several continuous predictors

3.1 The stackloss problem

3.2 Multicolinearity in Multiple Regression

3.3 Part correlation in multiple regression

Further Readings

R Markdown

3.1 The `stackloss` problem