#load libraries
library(car)
## Caricamento del pacchetto richiesto: carData
library(ggplot2)
pkmn <- read.csv("pkmn.csv")
Dataset: Pokémon
Pokémon are creatures either wild or under the coaching direction of Pokémon trainers who exercise their Pokemon in sports manly battle to increase their stats.
head(pkmn)
## Name Type Total HP Attack Speed
## 1 Slowbro Water 416.5443 95 75 30
## 2 Kecleon Normal 408.1936 60 90 40
## 3 Remoraid Water 309.6430 35 65 65
## 4 Ditto Normal 242.7398 48 48 48
## 5 Stoutland Normal 474.0814 85 110 80
## 6 Golduck Water 423.0775 80 82 85
Attack, Speed and HP are the base stats for evaluating each pokémon in fights. Total refers to the sum of base stats (Attack, Speed, HP, Special Attack, Special Defence, Defence). Speed determines the timelaps of making the first attack. Type highlights the strength and weakness of pokémon (water wins against fire, etc.). HP means Health points.
At the beginning, we want to exclude the variable Type. Therefore, an appropriate regression model is represented by
\[Total_i = \beta_1 + \beta_2 HP_i + \beta_3 Attack_i + \beta_4 Speed_i + \epsilon_i\] with the following assumptions:
\(\epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\), \(i=1, \ldots, 150\)
Linear independence of covariates
Question: In the case of \(k \ge 1\) covariates, how can we show graphically the relationship between the response variable Total and our covariates?
Answer: When we have 2 covariates, the “right” plot should be the 3-dimensional plot. However, the main issue is the lower intepretability and it is just available in the case of 2 covariates. One way would be to represent our data through a Scatter matrix, keeping in mind we are just considering the marginal effect of each independent variable on the response (although with multiple regression we want to explain the combined effect). But it can be used with \(k \ge 2\) covariates.
Scatter plot Matrix
Scatter_Matrix <- GGally::ggpairs(pkmn,columns = c(3:6),
title = "Scatter Plot Matrix for Pokémon Dataset",
axisLabels = "show")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Marginal interpretation
Total vs Attack: the relationship among these two variables appears to be linear. Indeed, the linear correlation is really high.
Total vs HP: these two variables are less correlated. We can observe that as Total increases the increment of HP is less proportional.
Total vs Speed: these two variables are less correlated and the dispersion is increased. However, we can observe that as Speed increases in general Total increases as well.
Relationship among independent variables: we can observe that the correlation among independent variables (Attack-Speed, Attack-HP, HP-Speed) is not high, which represents a positive aspect in our case. Indeed, a marked linear dependence between independent variables leads to issues such as in the regression matrix X the corresponding columns are almost linearly dependent (multicollinearity). In the presence of multicollinearity, the estimates may be numerically unstable because the matrix \((X^TX)\) is nearly singular (a matrix is invertible iff it is nonsingular). \(\implies\) a good model should not include pairs of highly correlated variables.
mod <- lm(Total~HP + Attack + Speed, data=pkmn)
summary(mod)
##
## Call:
## lm(formula = Total ~ HP + Attack + Speed, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.644 -21.173 -4.357 13.146 114.869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.27971 9.37876 2.695 0.00786 **
## HP 0.25649 0.08163 3.142 0.00203 **
## Attack 4.13193 0.09544 43.295 < 2e-16 ***
## Speed 0.10078 0.10813 0.932 0.35286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.31 on 146 degrees of freedom
## Multiple R-squared: 0.9481, Adjusted R-squared: 0.947
## F-statistic: 888.8 on 3 and 146 DF, p-value: < 2.2e-16
Added-Variable Plots
avPlots(mod)
These plots allow us to visualize the relationship between each individual independent variable and the dependent variable in a model while holding other independent variables constant.
The estimated model corresponds to
\[ \widehat{Total}_i = 25.27971 + 0.25649 HP_i + 4.13193 Attack_i + 0.10078 Speed_i\]
The estimated standard errors of each coefficient are \(\sqrt{\hat{Var}(\hat{\beta}_1)} = 9.37876\), \(\sqrt{\hat{Var}(\hat{\beta}_2)} = 0.08163\), \(\sqrt{\hat{Var}(\hat{\beta}_3)} = 0.09544\) and \(\sqrt{\hat{Var}(\hat{\beta}_4)} = 0.10813\). Considering the following system of hypothesis \[\begin{cases} \mbox{H0: } \beta_r = 0\\ \mbox{H1: } \beta_r \ne 0 \end{cases}\]
for \(r=1,2,3,4\). The t-values for \(\beta_1\), \(\beta_2\), \(\beta_3\) and \(\beta_4\) are
\[t_1^{oss} = \frac{\hat{\beta}_1}{\sqrt{\hat{Var}(\hat{\beta}_1)}} = 2.695\]
\[t_2^{oss} = \frac{\hat{\beta}_2}{\sqrt{\hat{Var}(\hat{\beta}_2)}} = 3.142\]
\[t_3^{oss} = \frac{\hat{\beta}_3}{\sqrt{\hat{Var}(\hat{\beta}_3)}} = 43.295\]
\[t_4^{oss} = \frac{\hat{\beta}_4}{\sqrt{\hat{Var}(\hat{\beta}_4)}} = 0.932\]
Significance of coefficients: \(\beta_1\) and \(\beta_1\) are significant at 1% level. \(\beta_3\) is significant for each level of the significance codes. The null hypothesis related to the \(t_4\)- test is not rejected, which means we can conclude that \(\beta_4\) is not significant.
The f-value is \(f^{oss} = 888.8\) and the p-value is \(\alpha^{oss}= P(F_{3, 146} > f^{oss}) \approx 0\). We reject the null hypothesis.
The residual standard error corresponds to \(\sqrt{SSE/(n-4)}\) where \(n-4 = 146\) (degree of freedom).
The coefficients of determination correspond to \(R^2= 0.9481\) and \(R_{(adj)}^2=0.947\). The adjusted-\(R^2\) is penalized by the number of covariates.
Interpretation Coefficients
\(\hat{\beta}_1 = 25.28\), the mean of Total stats is equal to 25.28 while keeping the available covariates equal to zero. But it is not particular meaningful, for instance: having HP=0 makes no sense; there cannot be a pokémon without available health points! This situation can be extended to the other variables (Attack and Speed) as well.
\(\hat{\beta}_2 = 0.26\), the mean of Total stats increases by 0.26 points when HP increases by one point while holding other independent variables constant
\(\hat{\beta}_3 = 4.13\), the mean of Total stats increases by 4.13 points when Attack increases by one point while holding other independent variables constant
\(\hat{\beta}_4 = 0.10\), the mean of Total stats increases by 0.10 points when Speed increases by one point while holding other independent variables constant
Diagnostic plot
plot(mod)
Comments:
Residual vs Fitted: we can observe a slight pattern but it is not really marked
Q-Q Residuals: highlights a skewed empirical distribution of residuals
Scale-Location: the homoscedasticity assumption should be reasonable
Residuals vs Leverage: the observation 52 has an high leverage and we can also observe that there are some outliers (e.g. observation 31). However, these points are not influential for estimates (looking the Cook’s distance).
Test about a subset of regression coefficients
Let’s consider two model:
\[\mathcal{M}_1: \quad Total_i = \beta_1 + \beta_2 Attack_i + \epsilon_i\]
\[\mathcal{M}_3: \quad Total_i = \beta_1 + \beta_2 Attack_i + \beta_3 HP_i + \beta_4 Speed_i + \epsilon_i\]
We want to perfom a test with the following system of hypothesis
\[\begin{cases} \mbox{H0: } \beta_3= \beta_4 = 0\\ \mbox{H1: } \exists j \in \{3,4\}: \beta_j \ne 0 \end{cases}\]
mod_1 <- lm(Total~Attack, data=pkmn) #restricted model
anova(mod_1, mod)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack
## Model 2: Total ~ HP + Attack + Speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 133990
## 2 146 125388 2 8601.4 5.0077 0.00788 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Sum of Squares
\(\mathcal{M}_1\): \(SSE= 133990\)
\(\mathcal{M}_3\): \(SSE= 125388\)
Test statistic
\[F^{obs}= \frac{(SSE_{\mathcal{M}_1}-SSE_{\mathcal{M}_3})/(p_3 - p_1)}{SSE_{\mathcal{M}_3}/(n- p_3)}= \frac{(133990 - 125388)/(4-2)}{125388/(150-4)}=5.0077\] and the p-value is \(\alpha^{obs}= 0.00788\) \(\implies\) We reject our null hypothesis \(H0\).
We can also perform a F-test with \(H0: \beta_4 = 0\) vs \(H0: \beta_4 \ne 0\)
mod_2 <- lm(Total~Attack +HP, data=pkmn) #unconstrained model
anova(mod_2, mod)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack + HP
## Model 2: Total ~ HP + Attack + Speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 126134
## 2 146 125388 1 746.05 0.8687 0.3529
In this case, we cannot reject \(H_0\) (as is in the t-test). From this point, we will exclude Speed.
ANCOVA
Now, we want to add the variable Type in our model. Let’s start by visualizing the marginal relationship between Total and Type
theme_set(theme_bw())
ggplot(pkmn, aes(x=Type, y= Total, color=Type)) +
geom_boxplot(outlier.colour="black", outlier.shape=8, outlier.size=4)
Water Pokémon reach higher Total points than Normal Pokémon. We can observe the minimum of Total points is lower for Water Pokémon. In general, Water Pokémon get higher Total points than Normal Pokémon.
Let’s start by considering the following model
\[Total_i = \beta_1 + \beta_2 Attack_i + \beta_3 D_i + \epsilon_i\]
where
\[D_i = \begin{cases} 1,\quad \mbox{ if } Type_i =Water\\ 0,\quad \mbox{ Otherwise } \end{cases}\]
mod_anc_1 <- lm(Total~Attack +Type, data=pkmn)
summary(mod_anc_1)
##
## Call:
## lm(formula = Total ~ Attack + Type, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.373 -18.310 -3.538 11.418 103.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.10802 6.39629 4.238 3.96e-05 ***
## Attack 4.24972 0.07495 56.704 < 2e-16 ***
## TypeWater 28.28463 4.35591 6.493 1.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.61 on 147 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9563
## F-statistic: 1631 on 2 and 147 DF, p-value: < 2.2e-16
The estimated model corresponds to
\[ \widehat{Total}_i = 27.10802 + 4.24972 Attack_i + 28.28463D_i\]
The two estimated regression line for each Type
ggplot(pkmn, aes(x=Attack, y=Total, color=Type))+ geom_point() + geom_smooth(method = "lm", aes(fill=Type))
## `geom_smooth()` using formula = 'y ~ x'
which correspond to
\[ (Normal) \quad \widehat{Total}_i= 27.10802 + 4.24972 Attack_i\]
\[ (Water) \quad \widehat{Total}_i= 55.39265 + 4.24972 Attack_i\]
Adding an interaction
mod_anc_2 <- lm(Total~ Attack + Type + Type:Attack, data=pkmn)
summary(mod_anc_2)
##
## Call:
## lm(formula = Total ~ Attack + Type + Type:Attack, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.944 -16.457 -3.552 12.232 103.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.7010 8.3341 4.884 2.7e-06 ***
## Attack 4.0661 0.1043 38.971 < 2e-16 ***
## TypeWater 1.1075 11.7470 0.094 0.9250
## Attack:TypeWater 0.3660 0.1473 2.484 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.16 on 146 degrees of freedom
## Multiple R-squared: 0.9586, Adjusted R-squared: 0.9578
## F-statistic: 1128 on 3 and 146 DF, p-value: < 2.2e-16
The estimated model corresponds to
\[ \widehat{Total}_i = 40.7010 + 4.0661 Attack_i + 1.1075D_i + 0.36029 Attack_i \times D_i\]
From the previous output, we can observe that adding the interaction the coefficient related to \(D_i\) becomes not significant.
Thus, the two estimated regression line for each Type correspond to
\[ (Normal) \quad \widehat{Total}_i= 40.7010 + 4.0661 Attack_i\]
\[ (Water) \quad \widehat{Total}_i= 41.8085 + 4.42639 Attack_i\]
We can try to evaluate the interaction through F-test
anova(mod_anc_1, mod_anc_2)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack + Type
## Model 2: Total ~ Attack + Type + Type:Attack
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 104124
## 2 146 99900 1 4223.4 6.1723 0.01411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject \(H_0: \mbox{interaction no significant}\) at 5% level of significance.
Let’s check \(H_0: \mu_W = \mu_N\)
anova(mod_1, mod_anc_2)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack
## Model 2: Total ~ Attack + Type + Type:Attack
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 133990
## 2 146 99900 2 34089 24.91 4.923e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject \(H0\): The mean of Total for Water Pokémon differs from the mean of Total for Normal Pokémon.
Diagnostic plot
plot(mod_anc_2)
Comments:
Residual vs Fitted: we can observe a slight pattern but it is not really marked
Q-Q Residuals: highlights a skewed empirical distribution of residuals
Scale-Location: the homoscedasticity assumption should be reasonable
Residuals vs Leverage: No influential points are detected (looking the Cook’s distance)
ANOVA
active_data <- read.csv("active.csv")
head(active_data)
## site age edu group booster sex reason ufov hvltt hvltt2 hvltt3 hvltt4 mmse id
## 1 1 76 12 1 1 1 28 16 28 28 17 22 27 1
## 2 1 67 10 1 1 2 13 20 24 22 20 27 25 2
## 3 6 67 13 3 1 2 24 16 24 24 28 27 27 3
## 4 5 72 16 1 1 2 33 16 35 34 32 34 30 4
## 5 4 69 12 4 0 2 30 16 35 29 34 34 28 5
## 6 1 70 13 1 1 1 35 23 29 27 26 29 23 6
We are interested in finding out if there exists a relationship between hvltt2, group and sex.
Description of our data
The Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) trial is designed to determine whether cognitive training interventions (memory, reasoning, and speed of information processing), which have previously been found to be successful at improving mental abilities under laboratory or small-scale field conditions, can affect cognitively based measures of daily functioning.
hvltt2 represents the Hopkins Verbal Learning Test total score
group contains 4 categories: (1) “Memory”, (2) “Reasoning”, (3) “Speed”, (4) “Control”
sex is a binary variable: (1) Male, and (2) Female
Plot of our variables
ggplot(active_data, aes(x = factor(group), y = hvltt2, fill = factor(sex))) + geom_bar(stat="identity", position = "dodge", width = 0.7) + labs(x = "Group", y = "Memory performance", fill="Sex") + scale_x_discrete(breaks=c("1", "2", "3", "4"), labels=c("Memory", "Reasoning", "Speed", "Control")) + scale_fill_discrete(breaks=c("1", "2"), labels=c("Male", "Female"))
In general, we can observe that the memory performance is higher with Sex=Female rather than Sex=Male. For each Group, there is a slight difference in the change of memory performance.
Linear regression model
At the beginning, we are interested in estimating a linear regression model as
\[Y_i = \beta_1 + \beta_2 x_{i2}+ \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \epsilon_i\] where \(Y_i = hvltt2_i\),
\[x_{i2} = \begin{cases} 1,\quad \mbox{ if } Group_i =2\\ 0,\quad \mbox{ Otherwise } \end{cases} \quad x_{i3} = \begin{cases} 1,\quad \mbox{ if } Group_i =3\\ 0,\quad \mbox{ Otherwise } \end{cases}\quad x_{i4} = \begin{cases} 1,\quad \mbox{ if } Group_i =4\\ 0,\quad \mbox{ Otherwise } \end{cases}\]
and
\[x_{i5} = \begin{cases} 1,\quad \mbox{ if } Sex_i =2\\ 0,\quad \mbox{ Otherwise } \end{cases}\]
model<-lm(hvltt2~group +sex,data=active_data)
summary(model)
##
## Call:
## lm(formula = hvltt2 ~ group + sex, data = active_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4928 -3.4928 0.5072 3.8749 11.8749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.6908 0.3628 68.047 < 2e-16 ***
## group2 -1.5657 0.3786 -4.135 3.73e-05 ***
## group3 -1.2290 0.3739 -3.287 0.00103 **
## group4 -1.8429 0.3787 -4.866 1.25e-06 ***
## sex2 1.8019 0.3140 5.739 1.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.283 on 1570 degrees of freedom
## Multiple R-squared: 0.03899, Adjusted R-squared: 0.03654
## F-statistic: 15.93 on 4 and 1570 DF, p-value: 8.712e-13
The estimated regression model corresponds to
\[\widehat{Y}_i = 24.6908 - 1.5657 x_{i2} - 1.2290 x_{i3} - 1.8429 x_{i4} + 1.8019 x_{i5}\]
where the averages for Group-Male are \(\bar{y}_{A-M} = 24.6908\), \(\bar{y}_{B-M} = 23.1251\), \(\bar{y}_{C-M} = 23.4618\), and \(\bar{y}_{D-M} = 22.8479\); the averages for Group-Female are \(\bar{y}_{A-F} = 26.4927\), \(\bar{y}_{B-F} = 24.927\), \(\bar{y}_{C-F} = 25.2637\), and \(\bar{y}_{D-F} = 24.6498\).
Using the t-test for each regression coefficient, we can conclude that we reject \(H_0\) (i.e. each coefficient is significant).
The value of \(R^2 = 0.039\) (and, in this case, almost equivalent \(R_{adj}^2\)) is near to zero, which means that our model is not good (btw, we are considering a simple case to study ANOVA).
Evaluating interactions (two-way ANOVA with interaction)
Let’s evaluate interactions through the interaction plot.
ggplot(active_data,aes(group, hvltt2, fill=sex)) +
stat_summary(aes(group = sex, lty = sex), fun = "mean", geom = "line")+ labs(x = "Group", y = "Memory performance - Average", fill="Sex") + scale_x_discrete(breaks=c("1", "2", "3", "4"), labels=c("Memory", "Reasoning", "Speed", "Control")) + scale_fill_discrete(breaks=c("1", "2"), labels=c("Male", "Female"))
The plot suggests “absence of interaction” (parallel lines).
However, let’s try (just as an example on how to going on) to add interactions in our model.
mod_int<-lm(hvltt2~group +sex+ group:sex,data=active_data)
summary(mod_int)
##
## Call:
## lm(formula = hvltt2 ~ group + sex + group:sex, data = active_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5375 -3.5334 0.4625 3.8204 11.6522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.5294 0.5733 42.784 < 2e-16 ***
## group2 -1.1816 0.7952 -1.486 0.13753
## group3 -0.7437 0.8132 -0.915 0.36059
## group4 -1.9848 0.7604 -2.610 0.00913 **
## sex2 2.0080 0.6479 3.100 0.00197 **
## group2:sex2 -0.4982 0.9045 -0.551 0.58182
## group3:sex2 -0.6142 0.9159 -0.671 0.50257
## group4:sex2 0.2199 0.8775 0.251 0.80219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.286 on 1567 degrees of freedom
## Multiple R-squared: 0.03975, Adjusted R-squared: 0.03546
## F-statistic: 9.266 on 7 and 1567 DF, p-value: 2.799e-11
which corresponds to evaluate the following model:
\[Y_i = \beta_1 + \beta_2 x_{i2}+ \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i2}*x_{i5}+ \beta_7 x_{i3}*x_{i5} + \beta_8 x_{i4}*x_{i5} + \epsilon_i\]
From the output of our model, we can observe that the coefficients related to the interactions are no-significant.