1. Mother and Daughter Heights
Data involve Mother and Daughter heights (we already used these data for the first exercise).
#load packages
require(alr4)
## Caricamento del pacchetto richiesto: alr4
## Warning: il pacchetto 'alr4' è stato creato con R versione 4.3.2
## Caricamento del pacchetto richiesto: car
## Warning: il pacchetto 'car' è stato creato con R versione 4.3.2
## Caricamento del pacchetto richiesto: carData
## Caricamento del pacchetto richiesto: effects
## Warning: il pacchetto 'effects' è stato creato con R versione 4.3.2
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
require(ggplot2)
## Caricamento del pacchetto richiesto: ggplot2
## Warning: il pacchetto 'ggplot2' è stato creato con R versione 4.3.2
#load data
data("Heights")
head(Heights)
There are two variables: mheight (mother’s height) and dheight (daughter’s height). Heights are expressed in inches.
Main summary statistics about the two variables:
summary(Heights)
## mheight dheight
## Min. :55.40 Min. :55.10
## 1st Qu.:60.80 1st Qu.:62.00
## Median :62.40 Median :63.60
## Mean :62.45 Mean :63.75
## 3rd Qu.:63.90 3rd Qu.:65.60
## Max. :70.80 Max. :73.10
nrow(Heights)
## [1] 1375
There are 1375 observations.
Our goal: We would like to find out if there exists a relationship between dheight and mheight
Let’s see the scatter plot of the above two variables
theme_set(theme_bw()) ## sets default white background theme for ggplot
ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")
The relationship among variables seems to be linear.
Let’s add the regression line
ggplot(data = Heights, aes(x=mheight, y=dheight)) + geom_point(col="red", alpha = 0.5) + geom_smooth(method = "lm") +
labs(x="Mother's height", y="Daughter's height", title = "Plot Mother and Daughter Heights")
## `geom_smooth()` using formula = 'y ~ x'
We are interesting in the following regression model \[ dheight = \beta_1 + \beta_2* mheight + \epsilon\] (check the assumptions in the previous exercises).
Inference
Let’s estimate the linear regression model
mod <- lm(dheight ~ mheight, data=Heights)
summary(mod)
##
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.397 -1.529 0.036 1.492 9.053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.91744 1.62247 18.44 <2e-16 ***
## mheight 0.54175 0.02596 20.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared: 0.2408, Adjusted R-squared: 0.2402
## F-statistic: 435.5 on 1 and 1373 DF, p-value: < 2.2e-16
Description of the output:
The estimated regression line can be represented by \[ \hat{dheight} = 29.917 + 0.542*mheight\] where \(\sqrt{\hat{Var}(\hat{\beta}_1)} = 1.6225\) and \(\sqrt{\hat{Var}(\hat{\beta}_2)} = 0.026\). Considering the following system of hypothesis \[\begin{cases} \mbox{H0: } \beta_r = 0\\ \mbox{H1: } \beta_r \ne 0 \end{cases}\]
for \(r=1,2\). The t-values for \(\beta_1\) and \(\beta_2\) are \[t_1^{oss} = \frac{\hat{\beta}_1}{\sqrt{\hat{Var}(\hat{\beta}_1)}} = 18.44\] \[t_2^{oss} = \frac{\hat{\beta}_2}{\sqrt{\hat{Var}(\hat{\beta}_2)}} = 20.87\] The p-values of both t-test are less than 0.01: we reject the null hypothesis \(\forall r\) (You can also obtain suggestions from R by looking the legend of significance codes).
The f-value is \[f^{oss} = t^2_2 = 435.5\]
and the p-value is \(\alpha^{oss}= P(F_{1, 1373} > f^{oss}) \approx 0\). We reject the null hypothesis.
The residual standard error corresponds to \(\sqrt{SSE/(n-2)}\) where \(n-2 = 1373\) (d.o.f.).
Interpretation of regression coefficients
\(\hat{\beta}_1 = 29.91744\): intercept of the regression line. In this case, it does not make sense to consider the case of mheight=0 implying no relevant information from the intercept.
\(\hat{\beta}_2 = 0.54175\): slope of the regression line. Interpretation: the mean of dheight increases by 0.54175 inches (\(\approx\) 1,376045 cm) as the value of mheight increases by one-unit.
Confidence Interval
Confidence Intervals for \(\beta_1\) and \(\beta_2\) with 95% confidence level
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 26.7346495 33.1002241
## mheight 0.4908201 0.5926739
For each \(\beta_r\), \(r=1,2\), we can compute the confidence interval with \(1-\alpha=0.95\) (you can verify the confidence level by looking the first line).
Remember that all of the estimates, intervals, and hypothesis tests arising in a regression analysis have been developed assuming that the model is correct. For this reason, it is important checking if the assumptions are reasonable within the data.
Analysis of Residuals
par(mfrow=c(2,2))
plot(mod)
Comments:
Residual vs Fitted, is used to understand the relationship between residuals and fitted values. We can observe a constant dispersion of the points (no patterns).
Q-Q Residuals, is used to verify the Gaussian assumption about errors. The dashed line represents the theoretical Gaussian distribution while the dots the empirical distribution (within our data). The empirical distribution seems to be similar to the theoretical one, which means that the Gaussian assumption of the errors would be reasonable.
Scale-Location, is used to check the homoscedasticity assumption of errors. In the plot we can observe that there are not patterns. Even the homoscedasticity assumption should be reasonable.
Residuals vs Leverage, is used to check other issues (such as outliers and leverage points). In our case, no influencial points are detected.
Prediction with simple linear regression model
Let consider new data for mheight
new_data <- data.frame(mheight=c(70.4, 60.6, 68, 57.4))
Although the prediction can be computed for any value of mheight, it may be unreliable if the new values are outside the range of mheight. Let’s check
cat('Min:', min(Heights$mheight), 'Max:', max(Heights$mheight))
## Min: 55.4 Max: 70.8
Each value of new data belongs to (55.4, 70.8).
Confidence Interval for the mean of dheight:
(prev_ic <- predict(mod, new_data, interval = "confidence"))
## fit lwr upr
## 1 68.05643 67.63431 68.47854
## 2 62.74731 62.59473 62.89988
## 3 66.75623 66.44934 67.06312
## 4 61.01372 60.72983 61.29760
For each new data \(x^*\), the above confidence interval correspond to:
\[( \hat{y}_* - t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)}, \hat{y}_* + t_{n-2, 1-\alpha/2} \sqrt{s^2\left(\frac{1}{n} + \frac{(x^*- \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)})\]
where \(\hat{y}_*= \hat{\beta}_1 + \hat{\beta}_2x^*\) which can be rewritten \(\hat{y}_*= \bar{y} + \hat{\beta}_2(x^*- \bar{x})\).
Plot with the previous confidence interval:
plot(Heights$mheight, Heights$dheight, xlim=range(new_data$mheight), ylab = "Daughter's Height", xlab = "Mother's Height")
matlines(new_data$mheight, prev_ic, lty=c(1,2,2), col="violet")
2. Simulated data - Analysis of Residuals
In the next sections, we are going to explore some cases in which the assumptions related to the linear model are not satisfied from the data.
To show several behavior, we can start by simulating our data
set.seed(123)
x <- sort(runif(50))
2.1 Issues with the form of the regression model
The linear regression model is generally defined as \(y = \beta_1 + \beta_2 x + \epsilon\) leading to estimate a regression line. However, sometimes the residuals have non-linear patterns.
y1 <- x^2+rnorm(50,sd=.05)
y2 <- -x^2+rnorm(50,sd=.05)
y3 <- (x-.5)^3+rnorm(50,sd=.01)
plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")
In these cases, we can see a systematic pattern which highlights a non-linear pattern. Indeed, we know the data generative mechanism and we can see that in the firsts two plot, a quadratic term should be added in the model while in the last plot, the pattern suggests a cubed term.
par(mfrow=c(2, 2))
plot(lm(y1~x))
par(mfrow=c(2, 2))
plot(lm(y1~poly(x, 2)))
2.2 Issues with Homoscedasticity assumption
Recall of Homoscedasticity assumption: \(Var(\epsilon_i)= \sigma^2 > 0, \quad \forall i\).
y1 <- x+x*rnorm(50,sd=.03)
y2 <- x+(1-x)*rnorm(50,sd=.03)
y3 <- x+c(rnorm(15,sd=.1),rnorm(20,sd=.4),rnorm(15,sd=.02))
y4 <- x+c(rnorm(20,sd=.5),rnorm(15,sd=.1),rnorm(15,sd=.4))
plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")
In the above plots, we can observe a different variability for each range of x values.
2.3 Issues with Gaussian assumption
Recall of Gaussian assumption: \(\epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\).
y1 <- x+.05*rt(50,df=2)
y2 <- x+.03*(rchisq(50,1)-.5)
y3 <- x-.03*(rchisq(50,1)-.5)
y4 <- x+.07*rt(50,df=2)
plot(x,residuals(lm(y1~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y2~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y3~x)),xlab="X",ylab="residuals")
plot(x,residuals(lm(y4~x)),xlab="X",ylab="residuals")
These plots highlight some issues with the Gaussian assumption of the errors. Let’s also check the QQ-plot and histogram of residuals for the firsts two plot (just for an example).
qqnorm(residuals(lm(y1~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y1~x)), col = "steelblue", lwd = 2)
hist(residuals(lm(y1~x)),main="Histogram",xlab="residuals", breaks=seq(min(residuals(lm(y1~x))),max(residuals(lm(y1~x))),length=10),ylab="")
qqnorm(residuals(lm(y2~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(y2~x)), col = "steelblue", lwd = 2)
hist(residuals(lm(y2~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(y2~x))),max(residuals(lm(y2~x))),length=10),ylab="")
Let assume that we don’t know anything about the data generative mechanism, what can we do? A way could be to transform our dependent variable, for instance:
qqnorm(residuals(lm(log(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(log(y2)~x)), col = "steelblue", lwd = 2)
hist(residuals(lm(log(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(log(y2)~x))),max(residuals(lm(log(y2)~x))),length=10),ylab="")
qqnorm(residuals(lm(sqrt(y2)~x)), pch = 1, frame = FALSE)
qqline(residuals(lm(sqrt(y2)~x)), col = "steelblue", lwd = 2)
hist(residuals(lm(sqrt(y2)~x)),main="Histogram",xlab="residuals",
breaks=seq(min(residuals(lm(sqrt(y2)~x))),max(residuals(lm(sqrt(y2)~x))),length=10),ylab="")