Simple Linear Regression Models and Analysis of Residuals

1. Mother and Daughter Heights

Data involve Mother and Daughter heights (we already used these data for the first exercise).

#load packages
require(alr4)

## Caricamento del pacchetto richiesto: alr4

## Warning: il pacchetto 'alr4' è stato creato con R versione 4.3.2

## Caricamento del pacchetto richiesto: car

## Warning: il pacchetto 'car' è stato creato con R versione 4.3.2

## Caricamento del pacchetto richiesto: carData

## Caricamento del pacchetto richiesto: effects

## Warning: il pacchetto 'effects' è stato creato con R versione 4.3.2

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

require(ggplot2)

## Caricamento del pacchetto richiesto: ggplot2

## Warning: il pacchetto 'ggplot2' è stato creato con R versione 4.3.2

#load data
data("Heights")
head(Heights)

There are two variables: mheight (mother’s height) and dheight (daughter’s height). Heights are expressed in inches.

Main summary statistics about the two variables:

summary(Heights)

##     mheight         dheight     
##  Min.   :55.40   Min.   :55.10  
##  1st Qu.:60.80   1st Qu.:62.00  
##  Median :62.40   Median :63.60  
##  Mean   :62.45   Mean   :63.75  
##  3rd Qu.:63.90   3rd Qu.:65.60  
##  Max.   :70.80   Max.   :73.10

nrow(Heights)

## [1] 1375

There are 1375 observations.

Our goal: We would like to find out if there exists a relationship between dheight and mheight

Let’s see the scatter plot of the above two variables

theme_set(theme_bw())  ## sets default white background theme for ggplot
ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")

The relationship among variables seems to be linear.

Let’s add the regression line

ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + geom_smooth(method = "lm") + 
  labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")

## `geom_smooth()` using formula = 'y ~ x'

We are interesting in the following regression model \[ dheight = \beta_1 + \beta_2* mheight + \epsilon\] (check the assumptions in the previous exercises).

Inference

Let’s estimate the linear regression model

mod <- lm(dheight ~ mheight, data=Heights)
summary(mod)

## 
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.397 -1.529  0.036  1.492  9.053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.91744    1.62247   18.44   <2e-16 ***
## mheight      0.54175    0.02596   20.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2402 
## F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

Description of the output:

Residuals: In the summary, we can find the main statistics about the residual (first line). The median is around zero and the distribution seems to be simmetric.
Coefficients: In this section, we have 4 columns (Estimate, Std. Error, t-value, p-value). Estimate refers to the estimates of regression coefficients. Std. Error is the square root of the estimated variance of each estimator. The t-value is the test statistic of t-test with the related p-value.

The estimated regression line can be represented by \[ \hat{dheight} = 29.917 + 0.542*mheight\] where \(\sqrt{\hat{Var}(\hat{\beta}_1)} = 1.6225\) and \(\sqrt{\hat{Var}(\hat{\beta}_2)} = 0.026\). Considering the following system of hypothesis \[\begin{cases} \mbox{H0: } \beta_r = 0\\ \mbox{H1: } \beta_r \ne 0 \end{cases}\]

for \(r=1,2\). The t-values for \(\beta_1\) and \(\beta_2\) are \[t_1^{oss} = \frac{\hat{\beta}_1}{\sqrt{\hat{Var}(\hat{\beta}_1)}} = 18.44\] \[t_2^{oss} = \frac{\hat{\beta}_2}{\sqrt{\hat{Var}(\hat{\beta}_2)}} = 20.87\] The p-values of both t-test are less than 0.01: we reject the null hypothesis \(\forall r\) (You can also obtain suggestions from R by looking the legend of significance codes).

Goodness of fit: The last part of the output refers to checking the goodness of fit. In this case, \(R^2 = 0.2408\) and then we can also evaluate the F-test. Considering the following system of hypothesis \[\begin{cases} \mbox{H0: } \beta_2 = 0\\ \mbox{H1: } \beta_2 \ne 0 \end{cases}\]

The f-value is \[f^{oss} = t^2_2 = 435.5\]

and the p-value is \(\alpha^{oss}= P(F_{1, 1373} > f^{oss}) \approx 0\). We reject the null hypothesis.

The residual standard error corresponds to \(\sqrt{SSE/(n-2)}\) where \(n-2 = 1373\) (d.o.f.).

Interpretation of regression coefficients

\(\hat{\beta}_1 = 29.91744\): intercept of the regression line. In this case, it does not make sense to consider the case of mheight=0 implying no relevant information from the intercept.
\(\hat{\beta}_2 = 0.54175\): slope of the regression line. Interpretation: the mean of dheight increases by 0.54175 inches (\(\approx\) 1,376045 cm) as the value of mheight increases by one-unit.

Confidence Interval

Confidence Intervals for \(\beta_1\) and \(\beta_2\) with 95% confidence level

confint(mod)

##                  2.5 %     97.5 %
## (Intercept) 26.7346495 33.1002241
## mheight      0.4908201  0.5926739

For each \(\beta_r\), \(r=1,2\), we can compute the confidence interval with \(1-\alpha=0.95\) (you can verify the confidence level by looking the first line).

Remember that all of the estimates, intervals, and hypothesis tests arising in a regression analysis have been developed assuming that the model is correct. For this reason, it is important checking if the assumptions are reasonable within the data.

Analysis of Residuals

par(mfrow=c(2,2))
plot(mod)

Comments:

Residual vs Fitted, is used to understand the relationship between residuals and fitted values. We can observe a constant dispersion of the points (no patterns).
Q-Q Residuals, is used to verify the Gaussian assumption about errors. The dashed line represents the theoretical Gaussian distribution while the dots the empirical distribution (within our data). The empirical distribution seems to be similar to the theoretical one, which means that the Gaussian assumption of the errors would be reasonable.
Scale-Location, is used to check the homoscedasticity assumption of errors. In the plot we can observe that there are not patterns. Even the homoscedasticity assumption should be reasonable.
Residuals vs Leverage, is used to check other issues (such as outliers and leverage points). In our case, no influencial points are detected.

Prediction with simple linear regression model

Let consider new data for mheight

new_data <- data.frame(mheight=c(70.4, 60.6, 68, 57.4))

Although the prediction can be computed for any value of mheight, it may be unreliable if the new values are outside the range of mheight. Let’s check

cat('Min:', min(Heights$mheight), 'Max:', max(Heights$mheight))

## Min: 55.4 Max: 70.8

Each value of new data belongs to (55.4, 70.8).

Confidence Interval for the mean of dheight:

(prev_ic <- predict(mod, new_data, interval = "confidence"))

##        fit      lwr      upr
## 1 68.05643 67.63431 68.47854
## 2 62.74731 62.59473 62.89988
## 3 66.75623 66.44934 67.06312
## 4 61.01372 60.72983 61.29760

For each new data \(x^*\), the above confidence interval correspond to:

\[( \hat{y}_* - t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)}, \hat{y}_* + t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)})\]

where \(\hat{y}_*= \hat{\beta}_1 + \hat{\beta}_2x^*\) which can be rewritten \(\hat{y}_*= \bar{y} + \hat{\beta}_2(x^*- \bar{x})\).

Plot with the previous confidence interval:

plot(Heights$mheight, Heights$dheight, xlim=range(new_data$mheight), ylab = "Daughter's Height", xlab = "Mother's Height")
matlines(new_data$mheight, prev_ic, lty=c(1,2,2), col="violet")

2. Simulated data - Analysis of Residuals

In the next sections, we are going to explore some cases in which the assumptions related to the linear model are not satisfied from the data.

To show several behavior, we can start by simulating our data

set.seed(123)
x <- sort(runif(50))

2.1 Issues with the form of the regression model

The linear regression model is generally defined as \(y = \beta_1 + \beta_2 x + \epsilon\) leading to estimate a regression line. However, sometimes the residuals have non-linear patterns.

y1 <- x^2+rnorm(50,sd=.05)
y2 <- -x^2+rnorm(50,sd=.05)
y3 <- (x-.5)^3+rnorm(50,sd=.01)

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

In these cases, we can see a systematic pattern which highlights a non-linear pattern. Indeed, we know the data generative mechanism and we can see that in the firsts two plot, a quadratic term should be added in the model while in the last plot, the pattern suggests a cubed term.

par(mfrow=c(2, 2))
plot(lm(y1~x))

par(mfrow=c(2, 2))
plot(lm(y1~poly(x, 2)))

2.2 Issues with Homoscedasticity assumption

Recall of Homoscedasticity assumption: \(Var(\epsilon_i)= \sigma^2 > 0, \quad \forall i\).

y1 <- x+x*rnorm(50,sd=.03)
y2 <- x+(1-x)*rnorm(50,sd=.03)
y3 <- x+c(rnorm(15,sd=.1),rnorm(20,sd=.4),rnorm(15,sd=.02))
y4 <- x+c(rnorm(20,sd=.5),rnorm(15,sd=.1),rnorm(15,sd=.4))

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")

In the above plots, we can observe a different variability for each range of x values.

2.3 Issues with Gaussian assumption

Recall of Gaussian assumption: \(\epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\).

y1 <- x+.05*rt(50,df=2)
y2 <- x+.03*(rchisq(50,1)-.5)
y3 <- x-.03*(rchisq(50,1)-.5)
y4 <- x+.07*rt(50,df=2)

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")

These plots highlight some issues with the Gaussian assumption of the errors. Let’s also check the QQ-plot and histogram of residuals for the firsts two plot (just for an example).

qqnorm(residuals(lm(y1~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y1~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(y1~x)),main="Histogram",xlab="residuals", breaks=seq(min(residuals(lm(y1~x))),max(residuals(lm(y1~x))),length=10),ylab="")

qqnorm(residuals(lm(y2~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y2~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(y2~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(y2~x))),max(residuals(lm(y2~x))),length=10),ylab="")

Let assume that we don’t know anything about the data generative mechanism, what can we do? A way could be to transform our dependent variable, for instance:

qqnorm(residuals(lm(log(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(log(y2)~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(log(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(log(y2)~x))),max(residuals(lm(log(y2)~x))),length=10),ylab="")

qqnorm(residuals(lm(sqrt(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(sqrt(y2)~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(sqrt(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(sqrt(y2)~x))),max(residuals(lm(sqrt(y2)~x))),length=10),ylab="")

Simple Linear Regression Models and Analysis of Residuals

Valentina Zangirolami (valentina.zangirolami@unimib.it)

2023-12-5