Chapter 7 Two-way analysis of variance
library(tidyverse)
library(emmeans)
library(vtable)
7.1 When to use this model?
A two-way ANOVA is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables. You can use a two-way ANOVA when you want to know how two independent variables, in combination, affect a dependent variable. In this example we will use two-way anova to see if applying fertilizer to flowers at the florist or/and at the customer effects the lifespan of a rose.
7.2 Organizing data
<- read.table("data/3_1_roser.txt", header = TRUE, sep = "", dec = ".", colClasses = c(rep("factor",4),"numeric"))
df print(as_tibble(df))
## # A tibble: 24 × 5
## gardener florist customer flor time
## <fct> <fct> <fct> <fct> <dbl>
## 1 0 0 0 1 10.1
## 2 0 0 0 2 8.9
## 3 0 0 0 3 11.4
## 4 1 0 0 1 10.9
## 5 1 0 0 2 6.9
## 6 1 0 0 3 11.2
## 7 0 1 0 1 9.6
## 8 0 1 0 2 9.9
## 9 0 1 0 3 11.3
## 10 0 0 1 1 11.8
## # ℹ 14 more rows
As this chapter is about two-way-ANOVA we will only look at two categorial variables and a responsvariable. For this data our respons variable is time
and we choose florist
and customer
as our explanatory variables.
= subset(df, select = c("florist","customer","time"))
df print(as_tibble(df))
## # A tibble: 24 × 3
## florist customer time
## <fct> <fct> <dbl>
## 1 0 0 10.1
## 2 0 0 8.9
## 3 0 0 11.4
## 4 0 0 10.9
## 5 0 0 6.9
## 6 0 0 11.2
## 7 1 0 9.6
## 8 1 0 9.9
## 9 1 0 11.3
## 10 0 1 11.8
## # ℹ 14 more rows
To understand the data better its a good idea to make a table of variables. Here are two examples: One using standard R and one using the package vtable:
str(df)
## 'data.frame': 24 obs. of 3 variables:
## $ florist : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
## $ customer: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ time : num 10.1 8.9 11.4 10.9 6.9 11.2 9.6 9.9 11.3 11.8 ...
<- c("If fertilizer was applied at florist","If fertilizer was applied at customer","Days before the rose withered")
Description <- c("Explanatory variable", "Explanatory variable", "Response variable")
Use cbind(vt(df, out = "return"),data.frame(Description,Use))
## Name Class Values Description Use
## 1 florist factor '0' '1' If fertilizer was applied at florist Explanatory variable
## 2 customer factor '0' '1' If fertilizer was applied at customer Explanatory variable
## 3 time numeric Num: 6.9 to 15.2 Days before the rose withered Response variable
We can see that florist
and customer
are both factor variables taking either 0 or 1 as values, where a 1 indicates that fertilizer has been applied and 0 indicates that fertilizer has not been applied. i.e. if florist
has a value of 1 the fertilizer was applied at the florist. time
is a numerical variable ranging from 6.9 to 15.2 and indicates how many days the rose survived.
7.3 Data exploration
It is convenient to compute summarizes of the outcome divided by the factors:
<- df %>%
sum_customer group_by(customer) %>%
summarise(
n = n(),
mean_time = mean(time),
sd_time = sd(time),
median_time = median(time))
sum_customer
## # A tibble: 2 × 5
## customer n mean_time sd_time median_time
## <fct> <int> <dbl> <dbl> <dbl>
## 1 0 12 10.3 1.42 10.5
## 2 1 12 12.4 2.13 13.0
This table suggests that applying fertilizer at the costumers has a possitive effective on the lifespan of the roses.
<- df %>%
sum_florist group_by(florist) %>%
summarise(
n = n(),
mean_time = mean(time),
sd_time = sd(time),
median_time = median(time))
sum_florist
## # A tibble: 2 × 5
## florist n mean_time sd_time median_time
## <fct> <int> <dbl> <dbl> <dbl>
## 1 0 12 10.4 1.88 10.9
## 2 1 12 12.2 1.94 12.6
This table suggests that applying fertilizer at the florist has a possitive effective on the lifespan of the roses.
We can also create a table of the combinations of the categorial variables to see if there is a possible interaction effect:
<- df %>%
sum_df group_by(florist, customer) %>%
summarise(
count = n(),
mean = mean(time, na.rm = TRUE),
sd = sd(time, na.rm = TRUE),
median = median(time, na.rm = TRUE)
)
## `summarise()` has grouped output by 'florist'. You can override using the `.groups` argument.
sum_df
## # A tibble: 4 × 6
## # Groups: florist [2]
## florist customer count mean sd median
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 0 0 6 9.9 1.73 10.5
## 2 0 1 6 11.0 2.03 11.4
## 3 1 0 6 10.6 1.06 10.4
## 4 1 1 6 13.8 0.965 13.4
This table suggest that applying fertilizer at both the florist and the costumer simultaniously has a large possitive effect on the lifespan of the roses, i.e. it suggests there is an interaction effect.
%>%
sum_customer ggplot(aes(x = customer, y = mean_time, fill = customer)) +
geom_bar(stat="identity",
position=position_dodge(), width = .65) +
geom_errorbar(aes(ymin = mean_time- sd_time, ymax = mean_time+ sd_time), width=.2,
position=position_dodge(.9)) +
labs(title="Barplot of sample means with error bars (+/- SD) for `customer`",
x="Fertilizer added at customer", y = "time", fill = "customer") +
scale_x_discrete(breaks=c("0","1"), labels=c("No", "Yes")) +
scale_fill_discrete(breaks=c("0","1"),labels=c("No", "Yes"))
%>%
sum_florist ggplot(aes(x = florist, y = mean_time, fill = florist)) +
geom_bar(stat="identity",
position=position_dodge(), width = .65) +
geom_errorbar(aes(ymin = mean_time- sd_time, ymax = mean_time + sd_time), width=.2,
position=position_dodge(.9)) +
labs(title="Barplot of sample means with error bars (+/- SD) for `florist`", x="Fertilizer added at florist", y = "time", fill = "florist") +
scale_x_discrete(breaks=c("0","1"), labels=c("No", "Yes")) +
scale_fill_discrete(breaks=c("0","1"),labels=c("No", "Yes"))
%>%
sum_df ggplot(aes(x = florist, y = mean, fill = customer)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.5) +
geom_errorbar(aes(
ymin=mean-sd,
ymax=mean+sd),
position = position_dodge(0.9),
width = 0.25) +
labs(
title="Barplot of sample means with error bars (+/- SD) for `florist`",
x="Fertilizer added at florist", y = "time")+
scale_x_discrete(
breaks=c("0","1"),
labels=c("No", "Yes")) +
scale_fill_discrete(
breaks=c("0","1"),
labels=c("No", "Yes"))
A boxplot can be good to see if there is an interaction effect (If applying fertilizer only have an effect if it is done both places)
%>%
df ggplot(
aes(x=florist,
y=time,
color = customer)) +
geom_boxplot(
outlier.colour="red",
outlier.shape=8,
outlier.size=4)+
scale_x_discrete(
breaks=c("0","1"),
labels=c("No", "Yes")) +
scale_color_discrete(
breaks=c("0","1"),
labels=c("No", "Yes"))
Here we can see that applying the fertilizer at the costumer and not the dealer, has a small positive effect, but applying both at the dealer and the costumers seems to have a large positive effect.
7.4 Fitting model
<- lm(time ~ florist*customer, data = df) m1_interaction
Doing a drop1 test on our models we can see which terms in the model are considered significant, and should be kept in the model
drop1(m1_interaction, test = "F")
## Single term deletions
##
## Model:
## time ~ florist * customer
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 45.757 23.487
## florist:customer 1 6.8267 52.583 24.824 2.9839 0.09951 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that florist and customer are significant terms while the interaction between the two is not significant.
<- lm(time ~ florist + customer, data = df) m1
drop1(m1, test = "F")
## Single term deletions
##
## Model:
## time ~ florist + customer
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 52.583 24.824
## florist 1 19.440 72.023 30.374 7.7637 0.011063 *
## customer 1 27.735 80.318 32.991 11.0764 0.003193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7.5 Validating the model
7.5.1 Why do I need to valide the model?
The two-way ANOVA can only be used given these three assumption:
- Independence of observations: Observations within each group are independent
- Homogeneity of variance: The variation around the mean for each group being compared should be similar among all groups
- Normal-distributed dependent variable: The depended variable should be normally distributed
All that we extract from the two-way ANOVA model about uncertainty of estimates (later also tests) is only valid if the model assumptions are valid.
7.5.2 How do I validate the model assumptions?
The plot()
function makes it easy to check the latter two assumptions about homoscedasticity and normal-distributed dependent variable. To check if the variation around the mean for each group being compared is similar among all groups we can do the following on the fitted model (here we use the model with an interaction effect):
#homogeneity of variance assumption
plot(m1, 1)
Here the assumption is fulfilled if the red line is horizontal. In this example it is almost perfectly horizontal and we would say that the assumption is fulfilled.
To check if the dependent variable is normally distributed we can make a qq-plot of the residuals:
#normality assumpttion
plot(m1, 2)
For the assumption to be fulfilled the dots need to follow the dotted line.
7.6 Hypothesis test
To do the two-way-ANOVA we need our fitted model and then do the test using joint_tests()
.
First we will do a two-way anova on the linear model with an interaction effect time ~ florist + customer + florist:customer
, as our data exploration suggests that there may be one. The null hypothesis for this model are the same two as before plus an additional one: Applying fertilizer at the shop has no effect on the effect of applying fertilizer at the costumer.
joint_tests(m1_interaction)
## model term df1 df2 F.ratio p.value
## florist 1 20 8.497 0.0086
## customer 1 20 12.123 0.0024
## florist:customer 1 20 2.984 0.0995
From the summary we get a p-value of 0.0995 on the interaction effect, which means we cannot reject the null hypothesis based on our data.
Now we will do a two-way anova on the linear model time ~ florist + customer
. The null hypothesis for this model are: There is no significant difference on the lifespan of the roses between applying fertilizer and not applying fertilizer at the florist, and there is no significant difference on the lifespan of the roses between applying fertilizer and not applying fertilizer at costumer.
joint_tests(m1)
## model term df1 df2 F.ratio p.value
## florist 1 21 7.764 0.0111
## customer 1 21 11.076 0.0032
From the summary we can see the p-value for applying fertilizer at the shop is 0.011, which means based on this test we can reject the null-hypothesis, i.e. this test suggest that there is a significant difference on whether you apply or dont apply fertilizer at the florist. The same conclusion can be drawn from the second p-value as it is 0.003.
7.7 Extracting estimates with emmeans()
Here we use the function emmeans()
to calculate the estimated marginal means for all possible values of the factor variables:
<- emmeans(m1_interaction, ~ florist*customer, test = "F")
m m
## florist customer emmean SE df lower.CL upper.CL
## 0 0 9.9 0.617 20 8.61 11.2
## 1 0 10.6 0.617 20 9.35 11.9
## 0 1 11.0 0.617 20 9.70 12.3
## 1 1 13.8 0.617 20 12.56 15.1
##
## Confidence level used: 0.95
Now we want to compare the estimated means between values of the factor variables. This is called the constrasts:
%>% pairs(simple = "florist") m
## customer = 0:
## contrast estimate SE df t.ratio p.value
## florist0 - florist1 -0.733 0.873 20 -0.840 0.4110
##
## customer = 1:
## contrast estimate SE df t.ratio p.value
## florist0 - florist1 -2.867 0.873 20 -3.283 0.0037
Here we see that the mean for Florist = 0 is not significantly different from the mean for Florist = 1 when Customer = 0, but they are significantly different when customer is 1.
%>% pairs(simple = "customer") m
## florist = 0:
## contrast estimate SE df t.ratio p.value
## customer0 - customer1 -1.08 0.873 20 -1.241 0.2291
##
## florist = 1:
## contrast estimate SE df t.ratio p.value
## customer0 - customer1 -3.22 0.873 20 -3.683 0.0015
Here we see that the mean for Customer = 0 is not significantly different from the mean for Customer = 1 when Florist = 0, but they are significantly different when Florist is 1.
In can be of interest to take the contrast of the contrasts, to see if the effect of florist
is different depending on customer
(i.e. the interaction effect).
%>%
m pairs(simple = "florist") %>% ## Order of florist and customer does not matter
pairs(simple = "customer",by = NULL)
## contrast = florist0 - florist1:
## contrast1 estimate SE df t.ratio p.value
## customer0 - customer1 2.13 1.23 20 1.727 0.0995
Here we see that that the interaction effect is not significant.
7.8 Writing article/report
7.8.1 How to write method section
In a method section one should clearly state the methods used to produce the results presented in the paper, such that anyone reading the paper can reproduce the result given the same data. Here is an example:
To test whether applying fertilizer to the roses at the florist or the customer, has an effect on their lifespan, we have choosen to do a two-way-ANOVA test. This will also be done on their interaction effect to see if applying fertilizer both places has a bigger effect.
7.8.2 How to write results section
Doing the two-way-ANOVA test we find that both florist
and customer
are significant, while their interaction is not, meaning, based on the test, that there is a significant difference on whether you apply or don’t apply fertilizer at the florist or at the customer, and no significant difference between applying it at both at the same time.
7.9 Miscellaneous
7.9.1 Why to not use ANOVA tables
The classical ANOVA-table only works if the dataset is balanced (there are equal number of observations of each label within a factor). In our case this is true and we can see that anova()
and joint_test()
outputs the same p values (except for some rounding):
joint_tests(m1_interaction)
## model term df1 df2 F.ratio p.value
## florist 1 20 8.497 0.0086
## customer 1 20 12.123 0.0024
## florist:customer 1 20 2.984 0.0995
anova(m1_interaction)
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## florist 1 19.440 19.4400 8.4971 0.008563 **
## customer 1 27.735 27.7350 12.1228 0.002352 **
## florist:customer 1 6.827 6.8267 2.9839 0.099509 .
## Residuals 20 45.757 2.2878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now i remove the 3 first observations in our dataset and we can see that the two methods output different p-values
<- lm(time ~ florist*customer, data = df[-1:-3,])
m_test joint_tests(m_test)
## model term df1 df2 F.ratio p.value
## florist 1 17 7.086 0.0164
## customer 1 17 9.910 0.0059
## florist:customer 1 17 1.741 0.2045
anova(m_test)
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## florist 1 14.814 14.8143 5.9533 0.025939 *
## customer 1 30.176 30.1761 12.1265 0.002851 **
## florist:customer 1 4.332 4.3320 1.7409 0.204526
## Residuals 17 42.303 2.4884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now the order in which our variables are written in the model also changes the results in the anova()
table while it is the same for joint_test()
<- lm(time ~ customer*florist, data = df[-1:-3,])
m_test2 joint_tests(m_test2)
## model term df1 df2 F.ratio p.value
## customer 1 17 9.910 0.0059
## florist 1 17 7.086 0.0164
## customer:florist 1 17 1.741 0.2045
anova(m_test2)
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## customer 1 22.800 22.8002 9.1625 0.007605 **
## florist 1 22.190 22.1902 8.9174 0.008295 **
## customer:florist 1 4.332 4.3320 1.7409 0.204526
## Residuals 17 42.303 2.4884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1