Session 09 Partial and Part Correlation, Linear Model Assumptions, Bootstrap Methods, and Estimator Bias and Variance

Partial and Part Correlation, Linear Model Assumptions, Parametric Bootstrap, and Estimator Bias and Variance


What do we want to do today?

We introduce the concepts of partial and part correlation, of essential importance in multiple linear regression, and take a look at what R has to offer in the {ppcor} package in that respect. Then we go back to the simple linear regression model and discuss its assumptions. All statistical models make some assumptions about the data, and only if their assumptions are satisfied then the application of the model to the data is justified. We proceed by introducing the concept of bias of a statistical estimate and introduce a way to estimate it via numerical simulations: the parametric bootstrap.

0. Prerequisits

Install:

install.packages('car')
install.packages('ppcor')

Setup:

library(tidyverse)
library(data.table)
library(Hmisc)
library(ppcor)
library(car)
dataDir <- paste0(getwd(), "/_data/")

1. Residuals, Partial and Part Correlation

The concepts of partial and part correlation are useful in the description of mediation. We have two RVs, \(X\) and \(Y\), and we are interested in the strength of their linear relationship. However, there is also another variable (or, a set of variables), \(Z\), that is related to \(X\) and \(Y\), and we ask: how does this additional \(Z\) variable affects the relationship between \(X\) and \(Y\)?

Partial correlation presents the most straightforward answer to this question. It is the coefficient of linear correlation that one obtains between \(X\) and \(Y\) after removing the shared variance of \(X\) and \(Z\), and of \(Y\) and \(Z\).

We will use the {ppcor} package to compute partial correlations in the following example. Before that: can we explain partial correlation conceptually? It turns out that is not difficult to explain what partial correlation is once the simple linear regression model is introduced.

linFit <- lm(data = iris,
             Petal.Length ~ Sepal.Length)
linFitPlot <- data.frame(
  x = iris$Sepal.Length,
  y = iris$Petal.Length,
  predicted = linFit$fitted.values,
  residuals = linFit$residuals
)
ggplot(data = linFitPlot,
       aes(x = x, y = y)) +
  geom_smooth(method = lm, se = F, color = "red", size = .25) +
  geom_segment(aes(x = x, 
                   y = predicted, 
                   xend = x, 
                   yend = predicted + residuals),
               color = "black", size = .2, linetype = "dashed") +
  geom_point(aes(x = x, y = y), color = "black", size = 2) +
  geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
  geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
  xlab("Sepal.Length") + ylab("Petal.Length") +
  theme_bw() + 
  theme(panel.border = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

We have Sepal.Length on the x-axis, and Petal.Length on the y-axis, producing a scatter plot of a sort that we have already seen many times before. If the relationship between the two variables was perfectly linear, all points would fall on the regression line. In this plot, the red dots represent the predictions from the best-fitting line in a plane where the x and y axes are spawned by the variables of interest. However, the relationship is not perfectly linear, as it can be observed from the vertical deviations of the data points from the line. The distance between what a linear model would predict (red points) and the actual data (white points) is called a residual. Residuals are represented by vertical lines connecting the model predictions and actual data points in this plot. The best-fitting line is exactly the one that minimizes these residuals that are considered as model errors.

Take \(X\) to be Sepal.Length, \(Y\) to be Petal.Length, and \(Z\) to be Sepal.Width: how does the correlation between \(Z\) and \(X\), on one, and \(Z\) and \(Y\), on the other hand, affects the correlation between \(X\) and \(Y\)? Let’s plot \(Z\) vs. \(X\) and \(Z\) vs. \(Y\):

linFit <- lm(data = iris,
             Sepal.Length ~ Sepal.Width)
linFitPlot1 <- data.frame(
  x = iris$Sepal.Width,
  y = iris$Sepal.Length,
  predicted = linFit$fitted.values,
  residuals = linFit$residuals
)
linFit <- lm(data = iris,
             Petal.Length ~ Sepal.Width)
linFitPlot2 <- data.frame(
  x = iris$Sepal.Width,
  y = iris$Petal.Length,
  predicted = linFit$fitted.values,
  residuals = linFit$residuals
)
linFitPlot <- rbind(linFitPlot1, linFitPlot2)
linFitPlot$Plot <- factor(c(rep("Sepal.Length",150), rep("Petal.Length",150)),
                             levels = c("Sepal.Length", "Petal.Length"))

ggplot(data = linFitPlot,
       aes(x = x, y = y)) +
  geom_smooth(method = lm, se = F, color = "red", size = .25) +
  geom_segment(aes(x = x, 
                   y = predicted, 
                   xend = x, 
                   yend = predicted + residuals),
               color = "black", size = .2, linetype = "dashed") +
  geom_point(aes(x = x, y = y), color = "black", size = 2) +
  geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
  geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
  xlab("Sepal.Width") + ylab("") +
  theme_classic() +
  facet_grid(. ~ Plot) + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(strip.background = element_blank()) +
  theme(axis.text.x = element_text(size = 6)) + 
  theme(axis.text.y = element_text(size = 6)) 

Sepal.Width has some correlation with both Sepal.Length and Petal.Length; upon plotting the best fitting lines, we can observe some residuals on both plots too. Partial correlation of Sepal.Length and Petal.Length, while controlling for Sepal.Width, is nothing else than the correlation between the residuals of Sepal.Length and Petal.Length following the linear regression of Sepal.Width on both variables:

partialCor <- cor(linFitPlot$residuals[1:150],  # Sepal.Length residuals
                  linFitPlot$residuals[151:300] # Petal.Length residuals
)
partialCor
## [1] 0.9153894

In comparison to:

cor(iris$Sepal.Length, iris$Petal.Length)
## [1] 0.8717538

we can conclude that the coefficient of linear correlation between these two variables increases after controlling for the effect of Sepal.Width!

In {ppcor}, the same partial correlation would be computed in the following way:

# partial correlation w. {ppcor}
dataSet <- iris
dataSet$Species <- NULL
partialCor1 <- pcor.test(dataSet$Sepal.Length, 
                         dataSet$Petal.Length,
                         dataSet$Sepal.Width,
                         method = "pearson")
partialCor1$estimate
## [1] 0.9153894

And of course:

partialCor1$p.value
## [1] 5.847914e-60
partialCor1$statistic
## [1] 27.56916

For the matrix of partial correlations, where the correlation of each pair of variables is computed after controlling for the effects of all the remaining variables, {ppcor} offers:

#### partial correlation in R
dataSet <- iris
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method = "pearson")
irisPCor$estimate # partial correlations
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
## Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
## Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
## Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000
irisPCor$p.value # results of significance tests
##              Sepal.Length  Sepal.Width Petal.Length  Petal.Width
## Sepal.Length 0.000000e+00 1.199846e-17 7.656980e-25 2.412876e-05
## Sepal.Width  1.199846e-17 0.000000e+00 8.753029e-17 1.104958e-05
## Petal.Length 7.656980e-25 8.753029e-17 0.000000e+00 7.332477e-47
## Petal.Width  2.412876e-05 1.104958e-05 7.332477e-47 0.000000e+00

t-test on n-2-k degrees of freedom ; k = num. of variables conditioned:

irisPCor$statistic
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length     0.000000    9.765380    12.502483   -4.362929
## Sepal.Width      9.765380    0.000000    -9.431189    4.553279
## Petal.Length    12.502483   -9.431189     0.000000   21.398708
## Petal.Width     -4.362929    4.553279    21.398708    0.000000

Good. And now, what a part - also known as semi-partial correlation would be? Take a look again at the previous plot, where Sepal.Width predicts Sepal.Length on the left, and Petal.Length on the right panel; residuals from both linear regressions are present. Partial correlation of Sepal.Length and Petal.Length was obtained by removing the effect of Sepal.Width from both variables, and, in effect, all that we had to do to obtain was to compute the correlation coefficient from the residuals - or, from what remains after removing what was predicted by Sepal.Width from these two variables. A semi-partial, or part correlation would be obtained if we had removed the effect of Sepal.Width from the second variable only: that would be Petal.Length in this case. It results in a correlation between (a) Sepal.Length and (b) what is left from Petal.Length (the residuals) after controlling for the effect of Sepal.Width:

partCor <- cor(iris$Sepal.Length,  # Sepal.Length in itself
            linFitPlot$residuals[151:300] # Petal.Length residuals
            )
partCor
## [1] 0.9090408

In {ppcor}, this part correlation is obtained by:

partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
                      dataSet$Sepal.Width,
                      method = "pearson")
partCor$estimate
## [1] 0.9090408

As ever, the p-value:

partCor$p.value
## [1] 9.407918e-58

and the t-test:

partCor$statistic
## [1] 26.44911

If we’re interested in a matrix of semi-partial correlations, where the first variable - the one from which no effects of any other variables will be removed - is found rows, and the second variable - the one from which the effects of all the remaining variables in the data set will be removed - found in columns:

irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   1.00000000  0.30389212    0.3890689  -0.1357714
## Sepal.Width    0.55758743  1.00000000   -0.5385056   0.2599849
## Petal.Length   0.18506103 -0.13959991    1.0000000   0.3167424
## Petal.Width   -0.09001634  0.09394365    0.4415000   1.0000000
irisSPCor$p.value
##              Sepal.Length  Sepal.Width Petal.Length  Petal.Width
## Sepal.Length 0.000000e+00 0.0001734384 1.023577e-06 9.989683e-02
## Sepal.Width  1.823304e-13 0.0000000000 1.672234e-12 1.417359e-03
## Petal.Length 2.433732e-02 0.0906073062 0.000000e+00 8.776581e-05
## Petal.Width  2.765857e-01 0.2560833579 1.944768e-08 0.000000e+00
irisSPCor$statistic
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length     0.000000    3.854222     5.103228   -1.655866
## Sepal.Width      8.116139    0.000000    -7.722074    3.253281
## Petal.Length     2.275404   -1.703473     0.000000    4.034967
## Petal.Width     -1.092105    1.140168     5.945498    0.000000

2. Simple Linear Regression: Model Assumptions

We will spend some time in inspecting the validity of this linear regression model as a whole. Usually termed model diagnostics, the following procedures are carried over to ensure that the model assumptions hold. Unfortunately, even for a model as simple as a simple linear regression, testing for model assumptions tends to get nasty all the way down… Most of the criteria cannot be judged by simply assessing the values of the respective statistics, and one should generally consider the model diagnostics step as a mini-study on its own - and the one going well beyond the time spent to reach the conclusions of the model performance on the data set, because none of one’s conclusions on the data set truly hold from a model whose assumptions are not met. Sadly, this is a fact that is overlooked too often in contemporary applied statistics.

2.1 The Linearity Assumption

In fact, we have already tested for this:

#### Test 1: Linearity assumption
# Predictor vs Criterion {ggplot2}
ggplot(data = iris, aes(x = Sepal.Length,
                        y = Petal.Length,
                        color = Species)
       ) + 
  geom_point(size = 1.5) +
  geom_smooth(method = 'lm', size = .25, se = F) + 
  ggtitle('Sepal Length vs Petal Length') + 
  theme_bw() + 
  theme(panel.border = element_blank())

The linearity assumption is obviously not satisfied!

2.2 The Normality of Residuals

We have already played with this too, in a different way only:

reg <- lm(Petal.Length ~ Sepal.Length, 
          data = iris)
resStReg <- rstandard(reg)
qqnorm(resStReg)
qqline(resStReg, col = "red")

What is rstandard()? See: standardized residual. We will discuss this in the session as we introduce the concept of leverage.

Let’ see what does the Shapiro-Wilk tells:

shapiro.test(reg$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  reg$residuals
## W = 0.99437, p-value = 0.831
  • and it seems like this assumption is met.

N.B. In the 19th century, statisticians like Gauss and Legendre assumed that errors (residuals) followed a normal distribution. This assumption facilitated the derivation of the least squares method and the development of inferential statistics. However, the Gauss-Markov theorem, formulated later, established that the OLS estimator is the Best Linear Unbiased Estimator (BLUE) under certain conditions, without requiring normality of errors.

2.3 Constant variance (or Homoscedasticity)

The model error (i.e. variance, computed from the model residuals) should be constant on all levels of the criterion:

# Predicted vs. residuals {ggplot2}
predReg <- predict(reg) # get predictions from reg
resReg <- residuals(reg) # get residuals from reg
plotFrame <- data.frame(predicted = predReg,
                        residual = resReg)
# plot w. {ggplot2}
ggplot(data = plotFrame,
       aes(x = predicted, y = residual)) +
  geom_point(size = 1.5, 
             colour = 'blue') +
  geom_smooth(method = 'lm',
              size = .25,
              alpha = .25, 
              color = "red") + 
  ggtitle("Predicted vs Residual") + 
  xlab("Predicted") + ylab("Residual") + 
  theme_bw() +
  theme(legend.position = "none") + 
  theme(panel.border = element_blank())

This can be confusing if one does not recall that we have fitted the model by having only one sample of observations at hand. Imagine drawing a sample of Sepal.Length and Petal.Length values repeatedly and trying to predict one from another from a previously fitted simple linear regression model. It is quite probable that we would get at observing varying residuals (model errors) for different draws of Petal.Length observed on the same level Sepal.Length upon prediction. However, the distribution of these residuals, more precisely: its variance, must be constant across all possible levels of Petal.Length. That is why we choose to inspect the scatter plot of predicted Petal.Length values vs. their respective residuals. Essentially, one should not be able to observe any regularity in this plot; if it turns out that any pattern emerges, i.e. that it is possible to predict the residuals from the levels of criterion - the simple linear model should abandoned.

Our simple linear regression model obviously suffers from heteroscedasticity, or a lack of constant variance across the measurements. The cause of the heteroscedasticity in this case is related to the existence of clusters of related observations, determined by the type of flower in the iris data set.

2.4 No outliers or influential cases

There are a plenty of proposed procedures to detect influential cases in simple linear regression. The influence.mesures() function will return most of the interesting statistics in this respect:

infMeas <- influence.measures(reg)
class(infMeas)
## [1] "infl"
str(infMeas)
## List of 3
##  $ infmat: num [1:150, 1:6] -0.09604 -0.07337 -0.04819 0.00823 -0.08682 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:150] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:6] "dfb.1_" "dfb.Sp.L" "dffit" "cov.r" ...
##  $ is.inf: logi [1:150, 1:6] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:150] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:6] "dfb.1_" "dfb.Sp.L" "dffit" "cov.r" ...
##  $ call  : language lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##  - attr(*, "class")= chr "infl"

What you need to extract from infMeans now is the $infmat field:

# as data.frame
infReg <- as.data.frame(influence.measures(reg)$infmat)
head(infReg)
##         dfb.1_     dfb.Sp.L        dffit     cov.r       cook.d         hat
## 1 -0.096043082  0.083847894 -0.125287420 1.0083451 0.0078334226 0.012074844
## 2 -0.073373683  0.065952424 -0.087630174 1.0226287 0.0038527553 0.015376584
## 3 -0.048187987  0.044163711 -0.054467481 1.0316711 0.0014919270 0.019461346
## 4  0.008234714 -0.007603596  0.009126234 1.0361863 0.0000419263 0.021797361
## 5 -0.086815960  0.077030572 -0.107779326 1.0160009 0.0058144466 0.013627836
## 6 -0.078953428  0.063136724 -0.133418625 0.9944456 0.0088373084 0.008590398

Sometimes, people focus on Cook’s distances: they are used to detect the influential cases by inspecting the effect of removal of each data point on the linear model. Cook and Weisberg (1982) consider values greater than 1 are troublesome:

wCook <- which(infReg$cook.d > 1)
wCook # we seem to be fine here
## integer(0)

The leverage tells us how influential a data point is by measuring how unusual or atypical is the combination of predictor values - in case of multiple linear regression - for that observation; in case of simple linear regression, try to think of it as simply a measure of how “lonely” a particular point is found on the predictor scale. For an introductory text I recommend: Influential Observations, from David M. Lane’s Online Statistics Education: An Interactive Multimedia Course of Study; for details, see: Leverage in simple linear regression.

# - Leverage: hat values
# - Average Leverage = (k+1)/n, k - num. of predictors, n - num. observations
# - Also termed: hat values, range: 0 - 1
# - see: https://en.wikipedia.org/wiki/Leverage_%28statistics%29
# - Various criteria (twice the average leverage, three times the average leverage...)
# - Say, twice the leverage:
k <- 1 # - number of predictors
n <- dim(iris)[1] # - number of observations
wLev <- which(infReg$hat > 2*((k + 1)/n))
print(wLev)
##  [1]   9  14  39  43 106 108 118 119 123 131 132 136

Finally, to inspect the influential cases visually, we can produce the Influence Plot, combining information on standardized residuals, leverage, and Cook’s distances:

## Influence plot
plotFrame <- data.frame(residual = resStReg,
                        leverage = infReg$hat,
                        cookD = infReg$cook.d)
ggplot(plotFrame,
       aes(y = residual,
           x = leverage)) +
  geom_point(size = plotFrame$cookD*100, shape = 1, color = 'blue') +
  ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
  theme(plot.title = element_text(size=8, face="bold")) +
  ylab("Standardized Residual") + xlab("Leverage") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

2.5 No autocorrelation in residuals

The final twist is related to the assumption that the model errors are not autocorrelated. The autocorrelation of a variable exists when its previously measured values are correlated with its subsequent measurements. The autocorrelation can be computed for different values of the lag, defining how far apart are the “present” and “past” observations of a variable assumed to be. For example, given \(X = x_1, x_2, ..., x_n\), one can compute the autocorrelation at lag of 1 by correlating \(X_i\) with \(X_{i-1}\), or at lag of 2 by correlating \(X_i\) with \(X_{i-2}\), etc.

In the setting of simple linear regression, the autocorrelation test that is most frequently met is the Durbin-Watson statistic:

# - D-W Statistic < 1 --> problematic {car}
durbinWatsonTest(reg)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.3869188      1.203885       0
##  Alternative hypothesis: rho != 0

The Durbin-Watson statistic will have a value of 2 if no autocorrelation at all is present in the model residuals. It tests the null hypothesis that the autocorrelation is zero, so that its statistical significance (\(p < .05\), conventionally) indicates that the autocorrelation in residuals is present.

3. Bias and variance of a statistical estimate. Parametric bootstrap.

Let’s estimate the linear regression model with Petal.Length as a criterion and Sepal.Length as a predictor again:

regModel <- lm(Petal.Length ~ Sepal.Length,
               data = iris)
coefs <- coefficients(regModel)
print(coefs)
##  (Intercept) Sepal.Length 
##    -7.101443     1.858433
linFitPlot <- data.frame(
  x = iris$Sepal.Length,
  y = iris$Petal.Length,
  predicted = regModel$fitted.values,
  residuals = regModel$residuals
)
ggplot(data = linFitPlot,
       aes(x = x, y = y)) +
  geom_smooth(method = lm, se = F, color = "red", size = .25) +
  geom_segment(aes(x = x, 
                   y = predicted, 
                   xend = x, 
                   yend = predicted + residuals),
               color = "black", size = .2, linetype = "dashed") +
  geom_point(aes(x = x, y = y), color = "black", size = 2) +
  geom_point(aes(x = x, y = y), color = "white", size = 1.5) +
  geom_point(aes(x = x, y = predicted), color = "red", size = 1) +
  xlab("Sepal.Length") + ylab("Petal.Length") +
  theme_bw() + 
  theme(panel.border = element_blank())

summary(regModel)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  -7.101443 0.50666229 -14.01613 6.133586e-29
## Sepal.Length  1.858433 0.08585565  21.64602 1.038667e-47

Ok. Now, we need to learn about the variation in the model error. Let’s compute the standard deviation of the model residuals:

residuals <- residuals(regModel)
std_res <- sd(residuals)
print(std_res)
## [1] 0.8648977

Enters the sim-fit loop: parametric bootstrap for the simple linear regression model.

  • In each iteration we use the original values of the predictor - iris$Sepal.Length - and the estimated model coefficients to compute the predictions: y_hat values.
  • Then we introduce randomness from the estimated standard deviation of the residuals - std_res - by drawing one observation from `Normal(mean = y_hat, sd = std_res),
  • pairing that value with the respective value of the predictors. Ends the sim step.
  • Then we estimate the linear regression model from the bootstrapped sample and pick up the coefficients from it: the fit step.
  • The distribution of the bootstrapped model coefficients is quite telling as we will observe.

The population is to the sample as the sample is to the bootstrap samples. – John Fox, 2002, Bootstrapping Regression Models, Appendix to An R and S-PLUS Companion to Applied Regression.

We will use 10,000 bootstrap samples.

n_bootstrap_samples <- 1:10000
bootstrap_estimates <- lapply(n_bootstrap_samples, function(x) {
  # - bootstrap! 
  y_hats <- coefs[2] * iris$Sepal.Length + coefs[1]
  y_hats <- sapply(y_hats, function(y) {
    rnorm(1, y, std_res)
  })
  # - boostrapped data:
  boot_frame <- data.frame(Sepal.Length = iris$Sepal.Length, 
                           Petal.Length = y_hats)
  # - estimate linear model
  boot_regModel <- lm(Petal.Length ~ Sepal.Length,
                      data = boot_frame)
  # - extract coefficients
  boot_coefs <- coefficients(boot_regModel)
  return(as.data.frame(t(boot_coefs)))
})
bootstrap_estimates <- rbindlist(bootstrap_estimates)
head(bootstrap_estimates)
##    (Intercept) Sepal.Length
##          <num>        <num>
## 1:   -7.170689     1.869170
## 2:   -7.560429     1.947850
## 3:   -7.516519     1.917274
## 4:   -6.996106     1.852557
## 5:   -7.017008     1.855717
## 6:   -7.935476     1.998799
ggplot(bootstrap_estimates, 
       aes(x = Sepal.Length)) + 
  geom_histogram(binwidth = .001, 
                 fill = 'darkorange', 
                 color = 'darkorange') +
  ggtitle("Boostrap estimates of the slope") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

From the regModel model object, slope of linear regression:

coefs[2]
## Sepal.Length 
##     1.858433

Slope of linear regression estimated as the mean of bootstrapped sample model estimates:

mean(bootstrap_estimates$Sepal.Length)
## [1] 1.858136

The bias of the slope estimate is then:

coefs[2] - mean(bootstrap_estimates$Sepal.Length)
## Sepal.Length 
## 0.0002966771

And the variance of the slope estimate is:

sd(bootstrap_estimates$Sepal.Length)
## [1] 0.08688962

Let’s see about the intercept then:

ggplot(bootstrap_estimates, 
       aes(x = `(Intercept)`)) + 
  geom_histogram(binwidth = .001, 
                 fill = 'darkred', 
                 color = 'darkred') +
  ggtitle("Boostrap estimates of the intercept") + 
  theme_bw() + 
  theme(panel.border = element_blank()) + 
  theme(plot.title = element_text(hjust = .5))

From the model:

coefs[1]
## (Intercept) 
##   -7.101443

From bootstrap:

mean(bootstrap_estimates$`(Intercept)`)
## [1] -7.09903

Variance from bootstrap:

sd(bootstrap_estimates$`(Intercept)`)
## [1] 0.5120923

And the bias of the intercept estimate is then:

coefs[1] - mean(bootstrap_estimates$`(Intercept)`)
##  (Intercept) 
## -0.002413072

R Markdown

R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.


License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.