Analysis of Accuracy Data using ANOVA and binomial GLMMs

Henrik Singmann

2021-01-12

Overview

This vignette shows how accuracy data can be analysed with afex using either ANOVA or a binomial generalized linear mixed model (i.e., a mixed model that uses the appropriate distributional family for such data). Accuracy data means data where each observation can be categorized as either a 0, which indicates failure, miss, or an erroneous response, or 1 which indicates success or a correct response.

We begin by loading the packages needed here and set a somewhat nicer ggplot2 theme.

library("afex")
library("emmeans")
library("dplyr")
library("ggplot2")
theme_set(theme_bw(base_size = 15) + 
            theme(legend.position="bottom", 
                  panel.grid.major.x = element_blank()))
library("cowplot")
library("ggbeeswarm")

Data and Research Question

The data we are looking at is from Lin and colleagues (2020) and investigates ego depletion using a novel paradigm. Ego depletion is a (social) psychological concept originating from Roy Baumeister’s work which can be summed up by the phrase ‘self-control/willpower is a muscle’: An initial use of self-control, such as performing a demanding task or resisting a temptation, depletes the available self-control resources so that subsequent tasks only have limited self-control resources available. The paper introducing the concept was published in 1998 (references can be found on Wikipedia). The ego depletion concept was extremely popular until the height of the replication crisis in psychology in which researcher also struggled to replicate the key ego depletion phenomenon just outlined. Recently, Lin and colleagues developed a new paradigm for investigation ego depletion. Their primary interest was on analyzing the data using a diffusion model, which will not be of interest here.

Lin and colleagues (2020) task followed the usual approach for ego depletion tasks. In a first phase, participants either worked on a high-demand task or a low-demand task (variable condition). Then, participants had to work on a Stroop task. The Stroop task consist of colour words (i.e., “red”, “blue”, or “yellow”) displayed in a colour (i.e., red, blue, or yellow). In each trial, one of the three words is displayed in one of the three colours (e.g., the word “red” could be displayed in red or blue). The task of the participants is to press a button corresponding to the three colours the word is displayed in (i.e., participants have to ignore the meaning of the word but focus on the colour the letters have).

The Stroop effect is the finding that it is easier to select the correct colour in a congruent trial in which the meaning of the word and the colour the word is displayed in match (e.g., the word “red” in red). In contrast, people are slower and make more errors in an incongruent trial in which there is a mismatch between word meaning and word colour (e.g., the word “red” in blue). In other words, it is difficult for people to ignore the meaning of a word when having to decide which colour the word is displayed in. The variable determining match between word meaning and world colour is congruency.

The hypothesis of the ego depletion concept is that it moderates or interacts with the congruency effect (i.e., difference in performance between congruent and incongruent trials). In particular, the congruency effect is smaller if participants are ego depleted compared to when they are not. We analyse this Stroop data in the following.

One of the new features of Lin et al.’s (2020) study compared to earlier ego-depletion studies is that both variables, condition and congruency, vary within-subjects. That is, each participants once performed a high-demand task followed by the Stroop task and then also a low-demand task followed by a Stroop task. This was usually done on two different days with a week time in between in counterbalanced order.

We then load the data, called stroop, which comes with afex. For simplicity we only focus on the data from Lin et al.’s (2020) Experiment 1 (the data of all their 4 studies is part of stroop). We focus on the accuracy of participants which is coded as either 0 (error) or 1 (correct response) in column acc. We also remove all NAs in the response variable and drop unused factor levels. This streamlines a few of the calls later on.

data("stroop")
## extract data from experiment 1 and remove NAs
stroop_e1 <- stroop %>%
  filter(!is.na(acc)) %>% 
  filter(study == "1") %>% 
  droplevels()

A look at the resulting data.frame reveals that we have trial-wise data. That is, each observation (i.e., row) is one response to a Stroop trial. The data is not yet aggregated so that each participant has one observation per design cell (i.e., denoting the average accuracy for each participant in this cell). We also see we still have quite a bit of data left, from 253 participants.

head(stroop_e1)
##    pno condition study trialnum  congruency acc    rt
## 1 s1_1   deplete     1        1   congruent   1 0.626
## 2 s1_1   deplete     1        2   congruent   1 0.550
## 3 s1_1   deplete     1        3 incongruent   1 0.872
## 4 s1_1   deplete     1        4   congruent   1 0.635
## 5 s1_1   deplete     1        5   congruent   1 0.660
## 6 s1_1   deplete     1        6 incongruent   1 0.667
str(stroop_e1)
## 'data.frame':    84667 obs. of  7 variables:
##  $ pno       : Factor w/ 253 levels "s1_1","s1_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ condition : Factor w/ 2 levels "control","deplete": 2 2 2 2 2 2 2 2 2 2 ...
##  $ study     : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trialnum  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ congruency: Factor w/ 2 levels "congruent","incongruent": 1 1 2 1 1 2 1 2 1 1 ...
##  $ acc       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rt        : num  0.626 0.55 0.872 0.635 0.66 0.667 0.538 0.927 0.583 0.588 ...

ANOVA Analysis

We begin with an analysis of the data using standard repeated-measures ANOVA. ANOVA is probably the most common approach to analyse such data in psychology, but its use is somewhat questionable. The reason is that ANOVA assumes that response variable is normally distributed (or more precisely, the conditional distribution of the data after taking the model into account, i.e., the residuals, are assumed to follow a normal distribution). Here, our data are only 0s and 1s which do not follow a normal but rather a binomial or Bernoulli distribution. Nevertheless, analysis of such data using models assuming normal distribution such as ANOVA is not uncommon and in many cases leads to the same conclusions than the more complicated model discussed below. However, as we will see below, the results can also change.

We set up the model using aov_ez and specify both factors, congruency and condition, correctly as within factors.

e1_anova <- aov_ez(
  id = "pno", 
  dv = "acc", 
  data = stroop_e1,
  within = c("congruency", "condition")
)
## Warning: More than one observation per cell, aggregating the data using mean (i.e,
## fun_aggregate = mean)!

Fitting this model produces a warning message telling us what we already know. There is more than one observation per participant and cell of the design (i.e., combination of our two factors) and the data is averaged before calculating the ANOVA.

Note that if we would not have removed the NAs from the data, this call would have failed with an error as all participants have NAs so all would have been removed. In that case, we could have added na.rm = TRUE to the aov_ez() call to ignore the NAs when aggregating the data.

If we take a look at the ANOVA table, we see a clear main effect of congruency, a somewhat weaker main effect of condition, but no interaction.

e1_anova
## Anova Table (Type 3 tests)
## 
## Response: acc
##                 Effect     df  MSE          F   ges p.value
## 1           congruency 1, 252 0.01 242.95 ***  .247   <.001
## 2            condition 1, 252 0.00     5.43 *  .003    .021
## 3 congruency:condition 1, 252 0.00       0.10 <.001    .757
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

The observed pattern of effects is in line with most recent failures to replicate ego depletion. The main effect of condition suggests there be some general effect, such as more errors after a demanding task, but not the predicted interaction of condition with the congruency effect. The original prediction of ego-depletion is that it reduces self-control, thus resulting in attenuated Stroop effects after depletion. This pattern would result in a significant interaction.

We can look at the effects in turn using emmeans. Because it is a repeated-measures design we use the multivariate model for those tests. We begin by looking at the Stroop effect, that is the effect of congruency.

emmeans(e1_anova, "congruency", model = "multivariate")
##  congruency  emmean       SE  df lower.CL upper.CL
##  congruent   0.9763 0.001201 252   0.9739   0.9787
##  incongruent 0.8867 0.005876 252   0.8752   0.8983
## 
## Results are averaged over the levels of: condition 
## Confidence level used: 0.95

We can see the usual Stroop effect. Accuracy is higher for congruent compared to incongruent items.

We also look at the effect of condition.

emmeans(e1_anova, "condition", model = "multivariate")
##  condition emmean      SE  df lower.CL upper.CL
##  control    0.936 0.00345 252    0.929    0.943
##  deplete    0.927 0.00375 252    0.920    0.935
## 
## Results are averaged over the levels of: congruency 
## Confidence level used: 0.95

We see that accuracy is roughly 1% lower in the ego-depletion compared to the control condition.

We can also plot both effects.

plot_grid(
  afex_plot(e1_anova, "congruency", error = "within", 
            data_geom = geom_quasirandom, data_alpha = 0.3) + 
    coord_cartesian(ylim = c(0.25, 1)),
  afex_plot(e1_anova, "condition", error = "within", 
            data_geom = geom_quasirandom, data_alpha = 0.3) +
    coord_cartesian(ylim = c(0.25, 1))
)
## NOTE: Results may be misleading due to involvement in interactions
## NOTE: Results may be misleading due to involvement in interactions

Mixed Model Analysis

A model with a more appropriate conditional distribution (instead of the normal distribution used by ANOVA) are generalized linear models with a binomial family. Because we have repeated measures both within participants within each design cell and across cells (i.e., we have within-subject factors) we need to use a mixed model. Thus, we have to employ a generalized linear mixed model (GLMM) with a binomial family. Here, we use the canonical link function, the logit (an alternative would be probit).

In our call to mixed we indicate the fact that we want to estimate a binomial GLMM (in contrast to a regular LMM) by passing the correct family, family = binomial. We could further specify the link function (e.g., family = binomial(link = "probit")), but because we use the canonical link here, this is not necessary.

As for all mixed models, we should begin our analysis with the “maximal random effect structure justified by the design” (Barr et al., 2013), the maximal model. Because all factors vary within-subjects in the present case and we only consider one random-effects grouping factor (participant), the maximum random effect structure involves by-participant random intercepts as well as by-participant random slopes for factors congruency, condition, and their interaction, as well as the correlation among the random effect parameters.

If you need an introduction to the topic of the random effect structure, consider reading our introduction to mixed models (Singmann & Kellen, 2019). A worked example of how to reduce the maximal model in case of converging warnings is provided in the mixed model vignette.

Another important decision when fitting a mixed model with afex::mixed is the choice of the method for testing model terms. The default method is the Kenward-Roger approximation. However, this is not possible for GLMMs (i.e., it only applies to LMMs). For GLMMs we only have two methods available, likelihood ratio tests (method = "LRT") and parametric bootstrap (method = "PB"). Parametric bootstrap is generally the preferable procedure, especially in settings with small N, specifically low numbers of levels for the random effects grouping factor (i.e., few participants in the present case). However, parametric bootstrap can be prohibitively time consuming as it requires refitting the model several times (suggested is at last a 1000 times per model term/effect). The main factor influencing the time is therefore the complexity of the random effect structure. Even with only a moderately complex structure, such as in the present case, it can take days (even though computations can be distributed on a multi-core machine as shown below).

One rule of thumb (for which I have never seen any actual data) is that with more than around 50 levels for each random effects grouping factor, likelihood ratio tests are probably okay and parametric bootstrap not necessary. Luckily, in the present case we have over 250 participants (i.e., way more than 50), so we decide to use likelihood ratio tests as a method for testing model terms (i.e., method = "LRT").

Setting-Up and Fitting the Models

There are two ways on how to set up such models in afex::mixed (lme4 provides another third variant that is not possible in afex::mixed). Before describing both ways, the somewhat surprising fact is that the second way, based on the aggregated data, is considerably faster than the first way. In the present case the first way takes more than an hour on my laptop whereas the second way takes only around one minute!

To understand the two different ways, it might help to recall some basics about the binomial distribution. It is defined by the number of observations and the number of successes and has one parameter, the success probability. Because we want to estimate the parameter of the binomial distribution, the success probability, we have to pass two pieces of information to the mixed function; the number of successes and the total number of trials. The easiest way to do so with the current data is by directly passing the accuracy column acc. It only consists of zeros and ones so the total number of observations is the number of trials, and the number of ones is the number of successes. The following code sets up the maximal model as discussed above using this approach. Note that fitting this model takes more than an hour (likely several hours).

e1_mixed1_v1 <- mixed(
  acc ~ congruency*condition + (congruency*condition|pno), 
  data = stroop_e1, 
  method = "LRT", 
  family = binomial
)

Before looking at the results, let us first discuss the alternative way of specifying the model. Note that the second way is usually considerably faster. This way consists of two steps. We first need to aggregate the data within each cell of the design and participant by calculating the proportion of correct responses and the number of trials for each cell. Then we can pass these two pieces of information to mixed, the proportion of successes as the dependent variable and the total number of successes as a weight. These two pieces of information again provide all information required for the binomial distribution. However, note that it is important that proportion times the weight provide an integer number, the actual number of successes which cannot have a decimal part.

At first, we need to aggregate the data. For this, we can use the following dplyr syntax.

stroop_e1_agg <- stroop_e1 %>% 
  group_by(condition, congruency, pno) %>% 
  summarise(acc = mean(acc), 
            n = n())

Then, we can set up the model. Note that we need to pass the total number of trials as the weight argument here. mixed now supports the exact same syntax as glmer so we can pass simply the unquoted name of the column holding the weights, weight = n.

The call to mixed is then pretty much the same as above, the only differences are that we have to use the data.frame with the aggregated data, potentially a new name of response variable (here I decided to use the same name, acc for the aggregated accuracy), and the weight argument.

e1_mixed1_v2 <- mixed(
  acc ~ congruency*condition + (congruency*condition|pno), 
  data = stroop_e1_agg, 
  method = "LRT", 
  family = binomial,
  weight = n
)

Dealing with Fit Warnings

Before looking at the results, we see that fitting produces quite a few warnings. These are reproduced below in the way we would also get them when simply typing in the object name (e.g., e1_mixed1_v2). The following are the warnings for the first variant, e1_mixed1_v1.

## Warning: lme4 reported (at least) the following warnings for 'full':
##   * Model failed to converge with max|grad| = 0.00378454 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency':
##   * Model failed to converge with max|grad| = 0.146845 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'condition':
##   * Model failed to converge with max|grad| = 0.0860405 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
##   * Model failed to converge with max|grad| = 0.0181415 (tol = 0.002, component 1)

And these are the warnings for the second variant, e1_mixed1_v2.

## Warning: lme4 reported (at least) the following warnings for 'full':
##   * Model failed to converge with max|grad| = 0.0142323 (tol = 0.002, component 1)
## Warning: lme4 reported (at least) the following warnings for 'congruency':
##   * Model failed to converge with max|grad| = 0.134205 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'condition':
##   * Model failed to converge with max|grad| = 0.120362 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
##   * Model failed to converge with max|grad| = 0.0534769 (tol = 0.002, component 1)

Both variants show quite a few warnings. The warnings are given separately for the different models estimated (i.e., full model with all terms and one model for each term removed from the full model). However, none of the warnings is the dreaded “singular fit” warning that clearly suggests an over-parameterised model. We therefore will be working with this random effect structure going forward.

Here, we have two different warnings suggesting some numerical issue at the maximum likelihood estimate. One warning tells us that for the result returned from the fitting (i.e., optimization) algorithm the absolute gradient of the deviance function is not small (i.e., there might be a better global optimum that was not discovered). Another warning relates to the size of the eigenvalue which suggests something similar, there might be better solutions in the parameter space. More on these warnings can be found in the lme4 help at ?troubleshooting (or help("troubleshooting", package = "lme4")).

Given that both types of warnings are not too dramatic, one way to address them is by trying different optimizers. mixed makes this easy via the all_fit argument, if TRUE several optimizers are applied to the output from the regular run (make sure to install packages optimx and dfoptim to use all possible available optimizers).

e1_mixed1_v2_allfit <- mixed(
  acc ~ congruency*condition + (congruency*condition|pno), 
  data = stroop_e1_agg, 
  method = "LRT", 
  family = binomial,
  weight = n, 
  all_fit = TRUE 
)
## Warning: lme4 reported (at least) the following warnings for 'full':
##   * Model failed to converge with max|grad| = 0.123381 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency':
##   * Model failed to converge with max|grad| = 0.137517 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'condition':
##   * Model failed to converge with max|grad| = 0.128127 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning: lme4 reported (at least) the following warnings for 'congruency:condition':
##   * Model failed to converge with max|grad| = 0.130338 (tol = 0.002, component 1)
##   * Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

Somewhat surprisingly the fit using all optimizers produces more convergence warnings than the previous fit. However, all warnings are of the same type as for the other variants.

Another way to deal with the warnings is by fitting a variant of the model with reduced random effect structure. Hopefully this model will not produce the warning, but similar results. If warnings are more severe than the current warning, fitting a reduced model is surely indicated. Critical warnings that make it necessary to reduce the model are for example a singular fit warning or a warning that the algorithm did not converge within the allowed number of iterations. Nevertheless, if this were a real data analysis for a manuscript it would probably still make sense to reduce the random effect structure, beginning with the correlation among the random effect parameters, until the warnings are gone. Then, we would compare the final (i.e., reduced) model with the maximal model to ensure that the pattern of significant and non-significant effects is the same in both cases (if not, this needs to be reported transparently). An example of such an iterative reduction of the random effect structure is given in the mixed model vignette.

ANOVA Table for GLMMs

We can now print the model to get the resulting ANOVA table, that is the table of model terms (i.e., main effects and interactions), as for the regular ANOVA model. Doing so would normally also reproduces the warnings shown above but is suppressed here.

We show the results of the three variants in turn.

e1_mixed1_v1 ## variant 1
## Mixed Model Anova Table (Type 3 tests, LRT-method)
## 
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1
## Df full model: 14
##                 Effect df      Chisq p.value
## 1           congruency  1 321.00 ***   <.001
## 2            condition  1  11.08 ***   <.001
## 3 congruency:condition  1     4.23 *    .040
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
e1_mixed1_v2  ## variant 2
## Mixed Model Anova Table (Type 3 tests, LRT-method)
## 
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1_agg
## Df full model: 14
##                 Effect df      Chisq p.value
## 1           congruency  1 321.05 ***   <.001
## 2            condition  1  11.10 ***   <.001
## 3 congruency:condition  1     4.23 *    .040
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
e1_mixed1_v2_allfit  ## variant 2 with all_fit = TRUE
## Mixed Model Anova Table (Type 3 tests, LRT-method)
## 
## Model: acc ~ congruency * condition + (congruency * condition | pno)
## Data: stroop_e1_agg
## Df full model: 14
##                 Effect df      Chisq p.value
## 1           congruency  1 320.96 ***   <.001
## 2            condition  1  11.06 ***   <.001
## 3 congruency:condition  1     4.23 *    .040
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1

As can be seen, the results of the three different variants are extremely similar. Especially the fact that the results from variant 2 and the variant with all_fit = TRUE are pretty much the same suggests that we can have some confidence in the results.

As for the ANOVA we see significant main effects of congruency and condition. However, we also see a significant congruency by condition interaction. As a reminder, this interaction is the precondition of the central ego depletion prediction.

Follow-Up Analysis and Plots

Let us first look at the main effects as for the ANOVA model. For this, we will always use the model with all_fit = TRUE, e1_mixed1_v2_allfit. For easier interpretability, we also always set type = "response" in the call to emmeans. This provides estimated marginal means on the response scale (i.e., in probability units after applying the inverse link function). In case of the default (i.e., type = "link") marginal mean are given on the linear scale which in this case is the logit scale.

emmeans(e1_mixed1_v2_allfit, "congruency", type = "response")
## NOTE: Results may be misleading due to involvement in interactions
##  congruency      prob          SE  df asymp.LCL asymp.UCL
##  congruent   0.982434 1.44579e-05 Inf  0.982406  0.982463
##  incongruent 0.909378 6.89756e-05 Inf  0.909242  0.909513
## 
## Results are averaged over the levels of: condition 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

We again see the Stroop effect. Accuracy is higher for congruent compared to incongruent items.

We also look at the effect of condition.

emmeans(e1_mixed1_v2_allfit, "condition", type = "response")
## NOTE: Results may be misleading due to involvement in interactions
##  condition    prob         SE  df asymp.LCL asymp.UCL
##  control   0.96299 2.9834e-05 Inf   0.96293   0.96305
##  deplete   0.95570 3.5437e-05 Inf   0.95563   0.95577
## 
## Results are averaged over the levels of: congruency 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

We see that accuracy is roughly 1% lower in the ego-depletion compared to the control condition.

We can also plot both effects as above. We essentially only have to replace the model name in the calls to afex_plot. However, we also remove the error = "within" to show model-based error bars.

plot_grid(
  afex_plot(e1_mixed1_v2_allfit, "congruency", 
            data_geom = geom_quasirandom, data_alpha = 0.3) + 
    coord_cartesian(ylim = c(0.25, 1)),
  afex_plot(e1_mixed1_v2_allfit, "condition",  
            data_geom = geom_quasirandom, data_alpha = 0.3) +
    coord_cartesian(ylim = c(0.25, 1))
)
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions

Finally, we can also have a look at the interaction. One way to do so is by looking at the reference grid of the two variables.

emmeans(e1_mixed1_v2_allfit, c("congruency", "condition"), type = "response")
##  congruency  condition     prob          SE  df asymp.LCL asymp.UCL
##  congruent   control   0.984783 1.77534e-05 Inf  0.984748  0.984817
##  incongruent control   0.912742 9.42464e-05 Inf  0.912558  0.912927
##  congruent   deplete   0.979731 2.35033e-05 Inf  0.979685  0.979777
##  incongruent deplete   0.905897 1.00894e-04 Inf  0.905699  0.906094
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

Alternatively, we can look at the effect of congruency conditional on condition. As will be shown below, this has some benefits. And because we will use this emmeans object later, we assign it to emm_inter_1.

emm_inter_1 <- emmeans(e1_mixed1_v2_allfit, "congruency", 
                       by = "condition", type = "response")
emm_inter_1
## condition = control:
##  congruency      prob          SE  df asymp.LCL asymp.UCL
##  congruent   0.984783 1.77534e-05 Inf  0.984748  0.984817
##  incongruent 0.912742 9.42464e-05 Inf  0.912558  0.912927
## 
## condition = deplete:
##  congruency      prob          SE  df asymp.LCL asymp.UCL
##  congruent   0.979731 2.35033e-05 Inf  0.979685  0.979777
##  incongruent 0.905897 1.00894e-04 Inf  0.905699  0.906094
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

With this object, it is now particularly easy to see if there are indeed differences in the Stroop effects across conditions. We can simply use the pairs function which will only calculate pairwise comparisons for each level of the by factor `conditions. Because of the significant interaction this shows that there is evidence for a reduced Stroop effect in the deplete condition. The odds ratio are around 5 in the deplete condition and over 6 in the control condition.

pairs(emm_inter_1)
## condition = control:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      6.187 0.010362 Inf 1088.115 <.0001 
## 
## condition = deplete:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.021 0.008404 Inf  964.109 <.0001 
## 
## Tests are performed on the log odds ratio scale

We can of course also plot the interaction. We use violin plots for the data to make this plot somewhat appealing.

afex_plot(e1_mixed1_v2_allfit, "condition", "congruency", 
          data_geom = geom_violin)
## Aggregating data over: pno

For reference, we can also make the plot based on the model using variant 1 (i.e., based on trial-wise non-aggregated data) which produces the same figure.

afex_plot(e1_mixed1_v1, "condition", "congruency", 
          data_geom = geom_violin)
## Aggregating data over: pno

Analysis of Two Studies with GLMM

The stroop data set contains all four experiments of Lin et al. (2020). So far we only looked at Experiment 1. Here, we want to look at Experiments 1 and 2 to show some additional functionality of analysing accuracy data with GLMMs. Let us begin by preparing the data.

## extract data from experiment 1 and remove NAs
stroop_e12 <- stroop %>%
  filter(!is.na(acc)) %>% 
  filter(study %in% c("1", "2")) %>% 
  droplevels()

We then have a look at the number of participants per experiment. As can be seen below, there is a clear imbalance with Experiment 1 having a lot more participants than Experiment 2. We will see below what we can do with this information.

stroop_e12 %>% 
  group_by(study) %>% 
  summarise(n = length(unique(pno)))
## # A tibble: 2 x 2
##   study     n
##   <fct> <int>
## 1 1       253
## 2 2       132

But first, we aggregate the data for analysing it using the second variant introduced above as it is dramatically faster.

stroop_e12_agg <- stroop_e12 %>% 
  group_by(study, condition, congruency, pno) %>% 
  summarise(acc = mean(acc), 
            n = n())

We are now ready to fit the model. However, because we now have to fit 8 models in total and we also have more data at hand, fitting will take quite a bit longer even when using the aggregated data. Especially, because based on our experience with the data from Experiment 1, it makes sense to set all_fit = TRUE here.

Estimating Models in Parallel

Therefore, it makes sense to reduce our fitting time by distributing the estimation across the available CPU cores using the parallel package support provided by mixed. Note that this simply distributes the different models across cores (i.e., each individual fit is still run on a single core). For this, we need to set up a cluster, cl, and then pass this cluster to mixed.

library("parallel")
nc <- detectCores() # number of cores
cl <- makeCluster(rep("localhost", nc)) # make cluster

We can then fit the model using the new data in a similar manner as above. The main changes are that we add *study to the fixed effects part of the model formula which estimates main effects and all interactions with study. We do not estimate any random slopes for study because participants are in only one study (i.e., study is a between-subjects factor). We also enable multi-core estimation by setting cl = cl.

e12_mixed1 <- mixed(
  acc ~ congruency*condition*study + (congruency*condition|pno), 
  data = stroop_e12_agg, 
  method = "LRT", 
  family = binomial,
  weight = n, 
  all_fit = TRUE, 
  cl = cl
)

Type III versus Type II Sums of Squares

The default type argument when using mixed (and the other afex functions) is 3 which estimates so-called type III sums of squares tests of model terms. There is a somewhat heated discussion on these topics in the literature which I do not want to rehash here (more can be found in our chapter). In short, the distinction between Type III (default in afex) and Type II (recommended by some vocal statisticians) is about how to deal with imbalance (i.e., different group sizes). Type III sums of squares assume that the imbalance is random and estimates a model in which all groups are assumed to have equal sizes. In contrast, using Type II sums of squares the differences in group sizes are assumed to be meaningful (e.g., as a consequence of different group sizes in the environment) and the model is set-up such that the differences in group sizes are represented.

Remember that the first experiment (study 1) had a lot more participants than the second experiment (study 2). However, the default Type III tests treat both studies as having the same sample size. In this case it might therefore make sense to also look at the results with Type II tests. For this, we simply need to add type = 2 (or equivalently type = "II") to the call as shown below.

e12_mixed1_t2 <- mixed(
  acc ~ congruency*condition*study + (congruency*condition|pno), 
  data = stroop_e12_agg, 
  method = "LRT", 
  family = binomial,
  weight = n, 
  all_fit = TRUE, 
  cl = cl,
  type = 2
)

Fitting both models shows a number of warnings similar to the previous model which we do not reproduce here. Let us instead take a look at the results.

We begin with the Type III model.

e12_mixed1
## Mixed Model Anova Table (Type 3 tests, LRT-method)
## 
## Model: acc ~ congruency * condition * study + (congruency * condition | 
## Model:     pno)
## Data: stroop_e12_agg
## Df full model: 18
##                       Effect df      Chisq p.value
## 1                 congruency  1 453.65 ***   <.001
## 2                  condition  1   10.43 **    .001
## 3                      study  1       0.14    .705
## 4       congruency:condition  1       1.72    .190
## 5           congruency:study  1       0.33    .566
## 6            condition:study  1       1.60    .205
## 7 congruency:condition:study  1       1.46    .226
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1

The results are now more in line with the results from the ANOVA analysis. We only see significant main effects of congruency and condition, but no congruency by condition interaction. Furthermore, no effect involving study is significant.

Let us now take a look at the Type II results.

e12_mixed1_t2
## Mixed Model Anova Table (Type 2 tests, LRT-method)
## 
## Model: acc ~ congruency * condition * study + (congruency * condition | 
## Model:     pno)
## Data: stroop_e12_agg
## Df full model(s): 14, 14, 14, 17, 17, 17, 18
##                       Effect df      Chisq p.value
## 1                 congruency  1 482.39 ***   <.001
## 2                  condition  1  11.57 ***   <.001
## 3                      study  1       0.21    .646
## 4       congruency:condition  1     2.86 +    .091
## 5           congruency:study  1       0.28    .598
## 6            condition:study  1       1.42    .234
## 7 congruency:condition:study  1       1.46    .226
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1

These look quite similar with one difference. The congruency by condition interaction is still not significant, however, the p-value is below .1.

Taken together the results somewhat suggest that the congruency by condition interaction is a pattern mainly found in (the larger) Experiment 1, but not in Experiment 2. In the Type III analysis in which both studies are weighed equally the interaction therefore does not reach significance. In contrast, in the Type II analysis in which the additional participants of Experiment 1 have a larger influence on the interaction, the p-value drops and is nearer the .05 cut-off. However, the difference in the interaction between the experiments is also not too large as the three way interaction of congruency by condition by study is clearly not significant (and as the highest order effect, this effect has to be the same for both types of sums of squares).

Follow-Up Tests for Type II Models

We can now of course also plot the data. Let us plot the congruency by condition interaction as before. We begin with the plot of the Type III model.

afex_plot(e12_mixed1, "condition", "congruency", 
          data_geom = geom_violin)
## Aggregating data over: pno
## NOTE: Results may be misleading due to involvement in interactions

We then plot the interaction of the Type II model.

afex_plot(e12_mixed1_t2, "condition", "congruency", 
          data_geom = geom_violin)
## Aggregating data over: pno
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions

Both look pretty much the same and actually are the same. The reason for this is also given as a message when producing the second plot. When using mixed Type II models, all follow-up tests (which includes plotting via afex_plot) are based on the full model. However, Type II model tests are not all based on the full model. Rather, for tests of lower-order effects higher order effects are not part of the comparison (i.e., tests of two-way interactions such as the congruency by condition interaction are compared against a reference model that only includes all two-way interactions and not the three-way interaction).

We can also see this when comparing the corresponding emmeans for both models:

emmeans(e12_mixed1, c("congruency", "condition"), type = "response")
## NOTE: Results may be misleading due to involvement in interactions
##  congruency  condition     prob          SE  df asymp.LCL asymp.UCL
##  congruent   control   0.983752 1.54389e-05 Inf  0.983722  0.983782
##  incongruent control   0.913573 7.61899e-05 Inf  0.913423  0.913722
##  congruent   deplete   0.980078 1.88459e-05 Inf  0.980040  0.980114
##  incongruent deplete   0.905551 8.25395e-05 Inf  0.905390  0.905713
## 
## Results are averaged over the levels of: study 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale
emmeans(e12_mixed1_t2, c("congruency", "condition"), type = "response")
## NOTE: Results may be misleading due to involvement in interactions
##  congruency  condition     prob          SE  df asymp.LCL asymp.UCL
##  congruent   control   0.983752 1.54389e-05 Inf  0.983722  0.983782
##  incongruent control   0.913573 7.61899e-05 Inf  0.913423  0.913722
##  congruent   deplete   0.980078 1.88459e-05 Inf  0.980040  0.980114
##  incongruent deplete   0.905551 8.25395e-05 Inf  0.905390  0.905713
## 
## Results are averaged over the levels of: study 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

We again see that they are identical. Thus, the Type II outputs do not actually reflect Type II tests.

A more appropriate way to look at this two-way interaction would be to use a model that reflects the Type II tests; that is, a model without the three-way interaction. A variant of this model is part of the e12_mixed1_t2 object. However, because of the way afex::mixed creates the set of tests, this model is parameterized in a different way so we cannot pass it to emmeans for tests. Therefore, we have to refit the model without the three-way interaction. We do so by changing the fixed effects part of the formula to (congruency+condition+study)^2 which means all main effects and up to two-way interactions.

e12_mixed1_t2_red <- mixed(
  acc ~ (congruency+condition+study)^2 + (congruency*condition|pno), 
  data = stroop_e12_agg, 
  method = "LRT", 
  family = binomial,
  weight = n, 
  all_fit = TRUE, 
  cl = cl,
  type = 2
)
e12_mixed1_t2_red
## Mixed Model Anova Table (Type 2 tests, LRT-method)
## 
## Model: acc ~ (congruency + condition + study)^2 + (congruency * condition | 
## Model:     pno)
## Data: stroop_e12_agg
## Df full model(s): 14, 14, 14, 17, 17, 17
##                 Effect df      Chisq p.value
## 1           congruency  1 482.39 ***   <.001
## 2            condition  1  11.57 ***   <.001
## 3                study  1       0.21    .646
## 4 congruency:condition  1     2.86 +    .091
## 5     congruency:study  1       0.28    .598
## 6      condition:study  1       1.42    .234
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1

We first have a look at the resulting ANOVA table (note, we suppress the warnings again here). As expected, the Type II tests are the same as in e12_mixed1_t2, because they are based on the exact same models.

We then take a look at the marginal means for the two-way interaction of congruency by condition.

emmeans(e12_mixed1_t2_red,  c("congruency", "condition"), type = "response") 
## emmeans are based on full model which includes all effects.
##  congruency  condition     prob          SE  df asymp.LCL asymp.UCL
##  congruent   control   0.983835 1.51410e-05 Inf  0.983806  0.983865
##  incongruent control   0.912916 7.55995e-05 Inf  0.912768  0.913064
##  congruent   deplete   0.979895 1.87376e-05 Inf  0.979858  0.979931
##  incongruent deplete   0.905955 8.10300e-05 Inf  0.905796  0.906114
## 
## Results are averaged over the levels of: study 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

We can see these are now a bit different from the previous one, but not by a lot. A better way to see the difference is to use the same approach as above and look at the estimated Stroop effects for each condition.

For the correct Type II model these are given by the following call:

emmeans(e12_mixed1_t2_red, "congruency", by = "condition", 
        type = "response") %>% 
  pairs()
## emmeans are based on full model which includes all effects.
## condition = control:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.806 0.007814 Inf 1306.888 <.0001 
## 
## condition = deplete:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.059 0.006805 Inf 1205.405 <.0001 
## 
## Results are averaged over the levels of: study 
## Tests are performed on the log odds ratio scale

For the wrong Type II model these are given below:

emmeans(e12_mixed1_t2, "congruency", by = "condition", type = "response") %>% 
  pairs()
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions
## condition = control:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.728 0.007817 Inf 1278.953 <.0001 
## 
## condition = deplete:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.131 0.006999 Inf 1198.815 <.0001 
## 
## Results are averaged over the levels of: study 
## Tests are performed on the log odds ratio scale

We can see that as expected the magnitude of the difference between the Stroop effects is smaller in e12_mixed1_t2 (5.73 vs. 5.13) than in e12_mixed1_t2_red (5.81 vs. 5.06).

One way to approximate the latter behaviour without actually refitting the model is by passing submodel = "minimal" (which in the present case is identical to submodel = ~congruency*condition, see also the corresponding emmeans vignette). This does not produce exact Type II marginal means as when actually refitting the model. But at least approximates those.

emmeans(e12_mixed1_t2, "congruency", by = "condition", type = "response", 
        submodel = "minimal") %>% 
  pairs()
## emmeans are based on full model which includes all effects.
## NOTE: Results may be misleading due to involvement in interactions
## condition = control:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.828 0.008192 Inf 1253.988 <.0001 
## 
## condition = deplete:
##  contrast                odds.ratio       SE  df z.ratio  p.value
##  congruent / incongruent      5.110 0.007365 Inf 1131.878 <.0001 
## 
## Results are averaged over the levels of: study 
## Tests are performed on the log odds ratio scale

Finally, we can also use this model to plot the two-way interaction.

afex_plot(e12_mixed1_t2_red, "condition", "congruency", 
          data_geom = geom_violin)
## Aggregating data over: pno
## emmeans are based on full model which includes all effects.

References