#load libraries
library(car)
## Caricamento del pacchetto richiesto: carData
library(ggplot2)
pkmn <- read.csv("pkmn.csv")
Dataset: Pokémon
Pokémon are creatures either wild or under the coaching direction of Pokémon trainers who exercise their Pokemon in sports manly battle to increase their stats.
head(pkmn)
## Name Type Total HP Attack Speed
## 1 Slowbro Water 416.5443 95 75 30
## 2 Kecleon Normal 408.1936 60 90 40
## 3 Remoraid Water 309.6430 35 65 65
## 4 Ditto Normal 242.7398 48 48 48
## 5 Stoutland Normal 474.0814 85 110 80
## 6 Golduck Water 423.0775 80 82 85
Attack, Speed and HP are the base stats for evaluating each pokémon in fights. Total refers to the sum of base stats (Attack, Speed, HP, Special Attack, Special Defence, Defence). Speed determines the timelaps of making the first attack. Type highlights the strength and weakness of pokémon (water wins against fire, etc.). HP means Health points.
At the beginning, we want to exclude the variable Type. Therefore, an appropriate regression model is represented by
Totali=β1+β2HPi+β3Attacki+β4Speedi+ϵi with the following assumptions:
ϵii.i.d.∼N(0,σ2), i=1,…,150
Linear independence of covariates
Question: In the case of k≥1 covariates, how can we show graphically the relationship between the response variable Total and our covariates?
Answer: When we have 2 covariates, the “right” plot should be the 3-dimensional plot. However, the main issue is the lower intepretability and it is just available in the case of 2 covariates. One way would be to represent our data through a Scatter matrix, keeping in mind we are just considering the marginal effect of each independent variable on the response (although with multiple regression we want to explain the combined effect). But it can be used with k≥2 covariates.
Scatter plot Matrix
Scatter_Matrix <- GGally::ggpairs(pkmn,columns = c(3:6),
title = "Scatter Plot Matrix for Pokémon Dataset",
axisLabels = "show")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Marginal interpretation
Total vs Attack: the relationship among these two variables appears to be linear. Indeed, the linear correlation is really high.
Total vs HP: these two variables are less correlated. We can observe that as Total increases the increment of HP is less proportional.
Total vs Speed: these two variables are less correlated and the dispersion is increased. However, we can observe that as Speed increases in general Total increases as well.
Relationship among independent variables: we can observe that the correlation among independent variables (Attack-Speed, Attack-HP, HP-Speed) is not high, which represents a positive aspect in our case. Indeed, a marked linear dependence between independent variables leads to issues such as in the regression matrix X the corresponding columns are almost linearly dependent (multicollinearity). In the presence of multicollinearity, the estimates may be numerically unstable because the matrix (XTX) is nearly singular (a matrix is invertible iff it is nonsingular). ⟹ a good model should not include pairs of highly correlated variables.
mod <- lm(Total~HP + Attack + Speed, data=pkmn)
summary(mod)
##
## Call:
## lm(formula = Total ~ HP + Attack + Speed, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.644 -21.173 -4.357 13.146 114.869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.27971 9.37876 2.695 0.00786 **
## HP 0.25649 0.08163 3.142 0.00203 **
## Attack 4.13193 0.09544 43.295 < 2e-16 ***
## Speed 0.10078 0.10813 0.932 0.35286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.31 on 146 degrees of freedom
## Multiple R-squared: 0.9481, Adjusted R-squared: 0.947
## F-statistic: 888.8 on 3 and 146 DF, p-value: < 2.2e-16
Added-Variable Plots
avPlots(mod)
These plots allow us to visualize the relationship between each individual independent variable and the dependent variable in a model while holding other independent variables constant.
The estimated model corresponds to
^Totali=25.27971+0.25649HPi+4.13193Attacki+0.10078Speedi
The estimated standard errors of each coefficient are √^Var(ˆβ1)=9.37876, √^Var(ˆβ2)=0.08163, √^Var(ˆβ3)=0.09544 and √^Var(ˆβ4)=0.10813. Considering the following system of hypothesis {H0: βr=0H1: βr≠0
for r=1,2,3,4. The t-values for β1, β2, β3 and β4 are
toss1=ˆβ1√^Var(ˆβ1)=2.695
toss2=ˆβ2√^Var(ˆβ2)=3.142
toss3=ˆβ3√^Var(ˆβ3)=43.295
toss4=ˆβ4√^Var(ˆβ4)=0.932
Significance of coefficients: β1 and β1 are significant at 1% level. β3 is significant for each level of the significance codes. The null hypothesis related to the t4- test is not rejected, which means we can conclude that β4 is not significant.
The f-value is foss=888.8 and the p-value is αoss=P(F3,146>foss)≈0. We reject the null hypothesis.
The residual standard error corresponds to √SSE/(n−4) where n−4=146 (degree of freedom).
The coefficients of determination correspond to R2=0.9481 and R2(adj)=0.947. The adjusted-R2 is penalized by the number of covariates.
Interpretation Coefficients
ˆβ1=25.28, the mean of Total stats is equal to 25.28 while keeping the available covariates equal to zero. But it is not particular meaningful, for instance: having HP=0 makes no sense; there cannot be a pokémon without available health points! This situation can be extended to the other variables (Attack and Speed) as well.
ˆβ2=0.26, the mean of Total stats increases by 0.26 points when HP increases by one point while holding other independent variables constant
ˆβ3=4.13, the mean of Total stats increases by 4.13 points when Attack increases by one point while holding other independent variables constant
ˆβ4=0.10, the mean of Total stats increases by 0.10 points when Speed increases by one point while holding other independent variables constant
Diagnostic plot
plot(mod)
Comments:
Residual vs Fitted: we can observe a slight pattern but it is not really marked
Q-Q Residuals: highlights a skewed empirical distribution of residuals
Scale-Location: the homoscedasticity assumption should be reasonable
Residuals vs Leverage: the observation 52 has an high leverage and we can also observe that there are some outliers (e.g. observation 31). However, these points are not influential for estimates (looking the Cook’s distance).
Test about a subset of regression coefficients
Let’s consider two model:
M1:Totali=β1+β2Attacki+ϵi
M3:Totali=β1+β2Attacki+β3HPi+β4Speedi+ϵi
We want to perfom a test with the following system of hypothesis
{H0: β3=β4=0H1: ∃j∈{3,4}:βj≠0
mod_1 <- lm(Total~Attack, data=pkmn) #restricted model
anova(mod_1, mod)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack
## Model 2: Total ~ HP + Attack + Speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 133990
## 2 146 125388 2 8601.4 5.0077 0.00788 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Sum of Squares
M1: SSE=133990
M3: SSE=125388
Test statistic
Fobs=(SSEM1−SSEM3)/(p3−p1)SSEM3/(n−p3)=(133990−125388)/(4−2)125388/(150−4)=5.0077 and the p-value is αobs=0.00788 ⟹ We reject our null hypothesis H0.
We can also perform a F-test with H0:β4=0 vs H0:β4≠0
mod_2 <- lm(Total~Attack +HP, data=pkmn) #unconstrained model
anova(mod_2, mod)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack + HP
## Model 2: Total ~ HP + Attack + Speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 126134
## 2 146 125388 1 746.05 0.8687 0.3529
In this case, we cannot reject H0 (as is in the t-test). From this point, we will exclude Speed.
ANCOVA
Now, we want to add the variable Type in our model. Let’s start by visualizing the marginal relationship between Total and Type
theme_set(theme_bw())
ggplot(pkmn, aes(x=Type, y= Total, color=Type)) +
geom_boxplot(outlier.colour="black", outlier.shape=8, outlier.size=4)
Water Pokémon reach higher Total points than Normal Pokémon. We can observe the minimum of Total points is lower for Water Pokémon. In general, Water Pokémon get higher Total points than Normal Pokémon.
Let’s start by considering the following model
Totali=β1+β2Attacki+β3Di+ϵi
where
Di={1, if Typei=Water0, Otherwise
mod_anc_1 <- lm(Total~Attack +Type, data=pkmn)
summary(mod_anc_1)
##
## Call:
## lm(formula = Total ~ Attack + Type, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.373 -18.310 -3.538 11.418 103.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.10802 6.39629 4.238 3.96e-05 ***
## Attack 4.24972 0.07495 56.704 < 2e-16 ***
## TypeWater 28.28463 4.35591 6.493 1.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.61 on 147 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9563
## F-statistic: 1631 on 2 and 147 DF, p-value: < 2.2e-16
The estimated model corresponds to
^Totali=27.10802+4.24972Attacki+28.28463Di
The two estimated regression line for each Type
ggplot(pkmn, aes(x=Attack, y=Total, color=Type))+ geom_point() + geom_smooth(method = "lm", aes(fill=Type))
## `geom_smooth()` using formula = 'y ~ x'
which correspond to
(Normal)^Totali=27.10802+4.24972Attacki
(Water)^Totali=55.39265+4.24972Attacki
Adding an interaction
mod_anc_2 <- lm(Total~ Attack + Type + Type:Attack, data=pkmn)
summary(mod_anc_2)
##
## Call:
## lm(formula = Total ~ Attack + Type + Type:Attack, data = pkmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.944 -16.457 -3.552 12.232 103.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.7010 8.3341 4.884 2.7e-06 ***
## Attack 4.0661 0.1043 38.971 < 2e-16 ***
## TypeWater 1.1075 11.7470 0.094 0.9250
## Attack:TypeWater 0.3660 0.1473 2.484 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.16 on 146 degrees of freedom
## Multiple R-squared: 0.9586, Adjusted R-squared: 0.9578
## F-statistic: 1128 on 3 and 146 DF, p-value: < 2.2e-16
The estimated model corresponds to
^Totali=40.7010+4.0661Attacki+1.1075Di+0.36029Attacki×Di
From the previous output, we can observe that adding the interaction the coefficient related to Di becomes not significant.
Thus, the two estimated regression line for each Type correspond to
(Normal)^Totali=40.7010+4.0661Attacki
(Water)^Totali=41.8085+4.42639Attacki
We can try to evaluate the interaction through F-test
anova(mod_anc_1, mod_anc_2)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack + Type
## Model 2: Total ~ Attack + Type + Type:Attack
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 104124
## 2 146 99900 1 4223.4 6.1723 0.01411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject H0:interaction no significant at 5% level of significance.
Let’s check H0:μW=μN
anova(mod_1, mod_anc_2)
## Analysis of Variance Table
##
## Model 1: Total ~ Attack
## Model 2: Total ~ Attack + Type + Type:Attack
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 133990
## 2 146 99900 2 34089 24.91 4.923e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject H0: The mean of Total for Water Pokémon differs from the mean of Total for Normal Pokémon.
Diagnostic plot
plot(mod_anc_2)
Comments:
Residual vs Fitted: we can observe a slight pattern but it is not really marked
Q-Q Residuals: highlights a skewed empirical distribution of residuals
Scale-Location: the homoscedasticity assumption should be reasonable
Residuals vs Leverage: No influential points are detected (looking the Cook’s distance)
ANOVA
active_data <- read.csv("active.csv")
head(active_data)
## site age edu group booster sex reason ufov hvltt hvltt2 hvltt3 hvltt4 mmse id
## 1 1 76 12 1 1 1 28 16 28 28 17 22 27 1
## 2 1 67 10 1 1 2 13 20 24 22 20 27 25 2
## 3 6 67 13 3 1 2 24 16 24 24 28 27 27 3
## 4 5 72 16 1 1 2 33 16 35 34 32 34 30 4
## 5 4 69 12 4 0 2 30 16 35 29 34 34 28 5
## 6 1 70 13 1 1 1 35 23 29 27 26 29 23 6
We are interested in finding out if there exists a relationship between hvltt2, group and sex.
Description of our data
The Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) trial is designed to determine whether cognitive training interventions (memory, reasoning, and speed of information processing), which have previously been found to be successful at improving mental abilities under laboratory or small-scale field conditions, can affect cognitively based measures of daily functioning.
hvltt2 represents the Hopkins Verbal Learning Test total score
group contains 4 categories: (1) “Memory”, (2) “Reasoning”, (3) “Speed”, (4) “Control”
sex is a binary variable: (1) Male, and (2) Female
Plot of our variables
ggplot(active_data, aes(x = factor(group), y = hvltt2, fill = factor(sex))) + geom_bar(stat="identity", position = "dodge", width = 0.7) + labs(x = "Group", y = "Memory performance", fill="Sex") + scale_x_discrete(breaks=c("1", "2", "3", "4"), labels=c("Memory", "Reasoning", "Speed", "Control")) + scale_fill_discrete(breaks=c("1", "2"), labels=c("Male", "Female"))
In general, we can observe that the memory performance is higher with Sex=Female rather than Sex=Male. For each Group, there is a slight difference in the change of memory performance.
Linear regression model
At the beginning, we are interested in estimating a linear regression model as
Yi=β1+β2xi2+β3xi3+β4xi4+β5xi5+ϵi where Yi=hvltt2i,
xi2={1, if Groupi=20, Otherwise xi3={1, if Groupi=30, Otherwise xi4={1, if Groupi=40, Otherwise
and
xi5={1, if Sexi=20, Otherwise
model<-lm(hvltt2~group +sex,data=active_data)
summary(model)
##
## Call:
## lm(formula = hvltt2 ~ group + sex, data = active_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4928 -3.4928 0.5072 3.8749 11.8749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.6908 0.3628 68.047 < 2e-16 ***
## group2 -1.5657 0.3786 -4.135 3.73e-05 ***
## group3 -1.2290 0.3739 -3.287 0.00103 **
## group4 -1.8429 0.3787 -4.866 1.25e-06 ***
## sex2 1.8019 0.3140 5.739 1.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.283 on 1570 degrees of freedom
## Multiple R-squared: 0.03899, Adjusted R-squared: 0.03654
## F-statistic: 15.93 on 4 and 1570 DF, p-value: 8.712e-13
The estimated regression model corresponds to
ˆYi=24.6908−1.5657xi2−1.2290xi3−1.8429xi4+1.8019xi5
where the averages for Group-Male are ˉyA−M=24.6908, ˉyB−M=23.1251, ˉyC−M=23.4618, and ˉyD−M=22.8479; the averages for Group-Female are ˉyA−F=26.4927, ˉyB−F=24.927, ˉyC−F=25.2637, and ˉyD−F=24.6498.
Using the t-test for each regression coefficient, we can conclude that we reject H0 (i.e. each coefficient is significant).
The value of R2=0.039 (and, in this case, almost equivalent R2adj) is near to zero, which means that our model is not good (btw, we are considering a simple case to study ANOVA).
Evaluating interactions (two-way ANOVA with interaction)
Let’s evaluate interactions through the interaction plot.
ggplot(active_data,aes(group, hvltt2, fill=sex)) +
stat_summary(aes(group = sex, lty = sex), fun = "mean", geom = "line")+ labs(x = "Group", y = "Memory performance - Average", fill="Sex") + scale_x_discrete(breaks=c("1", "2", "3", "4"), labels=c("Memory", "Reasoning", "Speed", "Control")) + scale_fill_discrete(breaks=c("1", "2"), labels=c("Male", "Female"))
The plot suggests “absence of interaction” (parallel lines).
However, let’s try (just as an example on how to going on) to add interactions in our model.
mod_int<-lm(hvltt2~group +sex+ group:sex,data=active_data)
summary(mod_int)
##
## Call:
## lm(formula = hvltt2 ~ group + sex + group:sex, data = active_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5375 -3.5334 0.4625 3.8204 11.6522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.5294 0.5733 42.784 < 2e-16 ***
## group2 -1.1816 0.7952 -1.486 0.13753
## group3 -0.7437 0.8132 -0.915 0.36059
## group4 -1.9848 0.7604 -2.610 0.00913 **
## sex2 2.0080 0.6479 3.100 0.00197 **
## group2:sex2 -0.4982 0.9045 -0.551 0.58182
## group3:sex2 -0.6142 0.9159 -0.671 0.50257
## group4:sex2 0.2199 0.8775 0.251 0.80219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.286 on 1567 degrees of freedom
## Multiple R-squared: 0.03975, Adjusted R-squared: 0.03546
## F-statistic: 9.266 on 7 and 1567 DF, p-value: 2.799e-11
which corresponds to evaluate the following model:
Yi=β1+β2xi2+β3xi3+β4xi4+β5xi5+β6xi2∗xi5+β7xi3∗xi5+β8xi4∗xi5+ϵi
From the output of our model, we can observe that the coefficients related to the interactions are no-significant.