1. Mother and Daughter Heights

Data involve Mother and Daughter heights (we already used these data for the first exercise).

#load packages
require(alr4)
## Caricamento del pacchetto richiesto: alr4
## Warning: il pacchetto 'alr4' è stato creato con R versione 4.3.2
## Caricamento del pacchetto richiesto: car
## Warning: il pacchetto 'car' è stato creato con R versione 4.3.2
## Caricamento del pacchetto richiesto: carData
## Caricamento del pacchetto richiesto: effects
## Warning: il pacchetto 'effects' è stato creato con R versione 4.3.2
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
require(ggplot2)
## Caricamento del pacchetto richiesto: ggplot2
## Warning: il pacchetto 'ggplot2' è stato creato con R versione 4.3.2
#load data
data("Heights")
head(Heights)

There are two variables: mheight (mother’s height) and dheight (daughter’s height). Heights are expressed in inches.

Main summary statistics about the two variables:

summary(Heights)
##     mheight         dheight     
##  Min.   :55.40   Min.   :55.10  
##  1st Qu.:60.80   1st Qu.:62.00  
##  Median :62.40   Median :63.60  
##  Mean   :62.45   Mean   :63.75  
##  3rd Qu.:63.90   3rd Qu.:65.60  
##  Max.   :70.80   Max.   :73.10
nrow(Heights)
## [1] 1375

There are 1375 observations.

Our goal: We would like to find out if there exists a relationship between dheight and mheight

Let’s see the scatter plot of the above two variables

theme_set(theme_bw())  ## sets default white background theme for ggplot
ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")

The relationship among variables seems to be linear.

Let’s add the regression line

ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + geom_smooth(method = "lm") + 
  labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")
## `geom_smooth()` using formula = 'y ~ x'

We are interesting in the following regression model \[ dheight = \beta_1 + \beta_2* mheight + \epsilon\] (check the assumptions in the previous exercises).

Inference

Let’s estimate the linear regression model

mod <- lm(dheight ~ mheight, data=Heights)
summary(mod)
## 
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.397 -1.529  0.036  1.492  9.053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.91744    1.62247   18.44   <2e-16 ***
## mheight      0.54175    0.02596   20.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2402 
## F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

Description of the output:

The estimated regression line can be represented by \[ \hat{dheight} = 29.917 + 0.542*mheight\] where \(\sqrt{\hat{Var}(\hat{\beta}_1)} = 1.6225\) and \(\sqrt{\hat{Var}(\hat{\beta}_2)} = 0.026\). Considering the following system of hypothesis \[\begin{cases} \mbox{H0: } \beta_r = 0\\ \mbox{H1: } \beta_r \ne 0 \end{cases}\]

for \(r=1,2\). The t-values for \(\beta_1\) and \(\beta_2\) are \[t_1^{oss} = \frac{\hat{\beta}_1}{\sqrt{\hat{Var}(\hat{\beta}_1)}} = 18.44\] \[t_2^{oss} = \frac{\hat{\beta}_2}{\sqrt{\hat{Var}(\hat{\beta}_2)}} = 20.87\] The p-values of both t-test are less than 0.01: we reject the null hypothesis \(\forall r\) (You can also obtain suggestions from R by looking the legend of significance codes).

The f-value is \[f^{oss} = t^2_2 = 435.5\]

and the p-value is \(\alpha^{oss}= P(F_{1, 1373} > f^{oss}) \approx 0\). We reject the null hypothesis.

The residual standard error corresponds to \(\sqrt{SSE/(n-2)}\) where \(n-2 = 1373\) (d.o.f.).

Interpretation of regression coefficients

Confidence Interval

Confidence Intervals for \(\beta_1\) and \(\beta_2\) with 95% confidence level

confint(mod)
##                  2.5 %     97.5 %
## (Intercept) 26.7346495 33.1002241
## mheight      0.4908201  0.5926739

For each \(\beta_r\), \(r=1,2\), we can compute the confidence interval with \(1-\alpha=0.95\) (you can verify the confidence level by looking the first line).

Remember that all of the estimates, intervals, and hypothesis tests arising in a regression analysis have been developed assuming that the model is correct. For this reason, it is important checking if the assumptions are reasonable within the data.

Analysis of Residuals

par(mfrow=c(2,2))
plot(mod)

Comments:

Prediction with simple linear regression model

Let consider new data for mheight

new_data <- data.frame(mheight=c(70.4, 60.6, 68, 57.4))

Although the prediction can be computed for any value of mheight, it may be unreliable if the new values are outside the range of mheight. Let’s check

cat('Min:', min(Heights$mheight), 'Max:', max(Heights$mheight))
## Min: 55.4 Max: 70.8

Each value of new data belongs to (55.4, 70.8).

Confidence Interval for the mean of dheight:

(prev_ic <- predict(mod, new_data, interval = "confidence"))
##        fit      lwr      upr
## 1 68.05643 67.63431 68.47854
## 2 62.74731 62.59473 62.89988
## 3 66.75623 66.44934 67.06312
## 4 61.01372 60.72983 61.29760

For each new data \(x^*\), the above confidence interval correspond to:

\[( \hat{y}_* - t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)}, \hat{y}_* + t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)})\]

where \(\hat{y}_*= \hat{\beta}_1 + \hat{\beta}_2x^*\) which can be rewritten \(\hat{y}_*= \bar{y} + \hat{\beta}_2(x^*- \bar{x})\).

Plot with the previous confidence interval:

plot(Heights$mheight, Heights$dheight, xlim=range(new_data$mheight), ylab = "Daughter's Height", xlab = "Mother's Height")
matlines(new_data$mheight, prev_ic, lty=c(1,2,2), col="violet")

2. Simulated data - Analysis of Residuals

In the next sections, we are going to explore some cases in which the assumptions related to the linear model are not satisfied from the data.

To show several behavior, we can start by simulating our data

set.seed(123)
x <- sort(runif(50))

2.1 Issues with the form of the regression model

The linear regression model is generally defined as \(y = \beta_1 + \beta_2 x + \epsilon\) leading to estimate a regression line. However, sometimes the residuals have non-linear patterns.

y1 <- x^2+rnorm(50,sd=.05)
y2 <- -x^2+rnorm(50,sd=.05)
y3 <- (x-.5)^3+rnorm(50,sd=.01)

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

In these cases, we can see a systematic pattern which highlights a non-linear pattern. Indeed, we know the data generative mechanism and we can see that in the firsts two plot, a quadratic term should be added in the model while in the last plot, the pattern suggests a cubed term.

par(mfrow=c(2, 2))
plot(lm(y1~x))

par(mfrow=c(2, 2))
plot(lm(y1~poly(x, 2)))

2.2 Issues with Homoscedasticity assumption

Recall of Homoscedasticity assumption: \(Var(\epsilon_i)= \sigma^2 > 0, \quad \forall i\).

y1 <- x+x*rnorm(50,sd=.03)
y2 <- x+(1-x)*rnorm(50,sd=.03)
y3 <- x+c(rnorm(15,sd=.1),rnorm(20,sd=.4),rnorm(15,sd=.02))
y4 <- x+c(rnorm(20,sd=.5),rnorm(15,sd=.1),rnorm(15,sd=.4))

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")

In the above plots, we can observe a different variability for each range of x values.

2.3 Issues with Gaussian assumption

Recall of Gaussian assumption: \(\epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\).

y1 <- x+.05*rt(50,df=2)
y2 <- x+.03*(rchisq(50,1)-.5)
y3 <- x-.03*(rchisq(50,1)-.5)
y4 <- x+.07*rt(50,df=2)

plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")

plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")

These plots highlight some issues with the Gaussian assumption of the errors. Let’s also check the QQ-plot and histogram of residuals for the firsts two plot (just for an example).

qqnorm(residuals(lm(y1~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y1~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(y1~x)),main="Histogram",xlab="residuals", breaks=seq(min(residuals(lm(y1~x))),max(residuals(lm(y1~x))),length=10),ylab="")

qqnorm(residuals(lm(y2~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y2~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(y2~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(y2~x))),max(residuals(lm(y2~x))),length=10),ylab="")

Let assume that we don’t know anything about the data generative mechanism, what can we do? A way could be to transform our dependent variable, for instance:

qqnorm(residuals(lm(log(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(log(y2)~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(log(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(log(y2)~x))),max(residuals(lm(log(y2)~x))),length=10),ylab="")

qqnorm(residuals(lm(sqrt(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(sqrt(y2)~x)), col = "steelblue", lwd = 2)

hist(residuals(lm(sqrt(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(sqrt(y2)~x))),max(residuals(lm(sqrt(y2)~x))),length=10),ylab="")