Overview of MPT Multiverse: An Example Application

Marius Barth and Henrik Singmann

2018-11-11

Package Overview

MPTmultiverse provides a single function, fit_mpt(), that performs a multiverse analysis for multinomial processing tree models. The idea of a multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016) is to perform and report all possible combinations of reasonable modeling choices. This contrasts with a more common analysis approach in which only one specific analysis (i.e., one path through a garden of forking paths) is reported.

fit_mpt performs a multiverse analysis combining two different factors.

For the frequentist approaches, no pooling (with and without parametric or nonparametric bootstrap) and complete pooling are implemented using MPTinR (Singmann & Kellen, 2013). For the Bayesian approaches, no pooling, complete pooling, and three different variants of partial pooling are implemented using TreeBUGS (Heck, Arnold, & Arnold, 2017).

Prerequisites

First of all, make sure that you have installed the latest version of R, of all necessary R packages, and of JAGS. To install JAGS, go to mcmc-jags.sourceforge.net and follow installation instructions. After that, install or update the required R packages:

install.packages("MPTmultiverse")
update.packages(ask = FALSE)

Example Data: Bayen & Kuhlmann (2011)

Here we use parts of the data from Experiment 1 reported in Bayen and Kuhlmann (2011) investigating source-memory of participants under two different cognitive load conditions. The 24 participants in the load condition generated a random number between one and nine every time they heard a tone, which happened every 2 s during the entire study phase, and said it out loud. The 24 participants in the no_load condition performed the study phase without a secondary task. Participants and both conditions performed the test phase in the same manner.

We use the same model as Bayen and Kuhlman (2011), model variant 4 of the 2-high threshold source-memory model (2HTSM) introduced by Bayen, Murnane, and Erdfelder (1996). Model variant 4 of the 2HTSM separates observed source-memory data into 4 parameters: Parameter \(D\) (item recognition); parameter \(d\) (source memory); parameter \(b\) (probability of guessing that an item is old); and parameter \(g\) (probability of guessing that an item comes from source A versus source B).

Both data and model come with package MPTmultiverse so their location can be obtained using function system.file. In other applications, the file paths need to be provided by the user to match the location of data and model on their file system. It often makes sense to set the working directory to the directory in which the data and model file resides, either via setwd() or via the menu.

# load packages
library("MPTmultiverse")

# If you're running the analysis from an .rmd file, you only need to ensure that
# the .rmd, .eqn, and .csv files are all in the same directory.

# ------------------------------------------------------------------------------
# MPT model definition & data

EQN_FILE <- system.file("extdata", "2HTSM_Submodel4.eqn", package = "MPTmultiverse")
DATA_FILE <- system.file("extdata", "Kuhlmann_dl7.csv", package = "MPTmultiverse")

# if .csv format uses semicolons ";" (German format):
data <- read.csv2(DATA_FILE, fileEncoding = "UTF-8-BOM")  ## use correct encoding if necessary

# if .csv format uses commata "," (international format):
# data <- read.csv(DATA_FILE, fileEncoding = "UTF-8-BOM")

# We first take a look at the data using head()
head(data)
#>   Subject ExpCond Yee Yeu Yen Yue Yuu Yun Yne Ynu Ynn
#> 1     110       1   6  12   5   2  18   5   0   2  14
#> 2     138       1  11   8   6   3  16   4   2   2  12
#> 3     141       1   9  10   7   1  16   5   2   1  13
#> 4     149       1  10   9   4   0  19   6   0   0  16
#> 5     102       1  12   4   8   5  11   8   0   0  16
#> 6     105       1  14   2   5   1  19   7   2   0  14

## We then plot the response frequencies using plotFreq from the TreeBUGS package
TreeBUGS::plotFreq(data, boxplot = FALSE, eqn = EQN_FILE)

The look at the data.frame tells us which columns contain the subject identifier and the variable encoding group membership. We need to record these variables for the use in fit_mpt.

The plot of the individual response frequencies shows quite some individual variability, but nothing concerning.

Next, we prepare the data for the fitting.

COL_ID <- "Subject"         # name of the variable encoding subject ID
COL_CONDITION <- "ExpCond"  # name of the variable encoding group membership


# Experimental conditions should be labeled in a meaningful way. To accomplish
# this, you may want to use the `factor` function:
unique(data[, COL_CONDITION])
#> [1] 1 2

data[[COL_CONDITION]] <- factor(
  data[[COL_CONDITION]]
  , levels = c(1:2)
  , labels = c("no_load", "load")
)

### check input data frame
head(data)
#>   Subject ExpCond Yee Yeu Yen Yue Yuu Yun Yne Ynu Ynn
#> 1     110 no_load   6  12   5   2  18   5   0   2  14
#> 2     138 no_load  11   8   6   3  16   4   2   2  12
#> 3     141 no_load   9  10   7   1  16   5   2   1  13
#> 4     149 no_load  10   9   4   0  19   6   0   0  16
#> 5     102 no_load  12   4   8   5  11   8   0   0  16
#> 6     105 no_load  14   2   5   1  19   7   2   0  14

Options

Every time the package MPTmultiverse is loaded, it automatically sets some more or less useful defaults for model estimation, usage of multiple processor cores, number of posterior predictive samples, etc. By calling mpt_options() without any parameters, you can inspect these default values. If you want to change them, call mpt_options with the respective parameter specified, i.e. mpt_options(n.iter = 1000). For testing purposes, you can also specify mpt_options("test"), which is a shorthand for setting fast, but highly unreliable settings. You can set options to defaults, again, by typing the shorthand mpt_options("default").

# How to change a single option:
mpt_options(n.iter = 1e3)

# For testing purposes, you can use this shorthand to set fast, but unreliable options:
mpt_options("test")

# List all options that were set for the different analysis approaches:
mpt_options()

Model Fitting

The main computations are performed with a call to fit_mpt. In the default settings, the ten analysis options offered by MPTmultiverse are performed. Type ?fit_mpt in the R console if you want to see what these options are and find out more about the parameters of the function. The help page also contains a comprehensive overview of the results object returned by fit_mpt.

Before fitting, we set a seed to make the analysis reproducible and set the options to the default settings.

set.seed(42)
mpt_options("default")

results <- fit_mpt(
  model = EQN_FILE
  , dataset = DATA_FILE
  , data = data
  , id = COL_ID
  , condition = COL_CONDITION
  , core = c("D", "d")
)

After fitting it is a good idea to save the results as a binary R data file for later. This is easiest done using save(). To save all information necessary to recreate the analysis we only need to save the results tibble as it contains both data and model as attributes (see str(results, 1)).

We can automatically create a file name for the file holding the results based on the model and data file.

save(results, file = paste0(EQN_FILE, "-", DATA_FILE, ".RData"))

In the current example this would usually lead to quite a long filename (e.g., see EQN_FILE), so one can also use a custom filename.

save(results, file = "results_bayen_kuhlmann_2HTSM4.RData")

One can also directly save in a subfolder of the current working directory (if the subfolder exists).

save(results, file = "fits/results_bayen_kuhlmann_2HTSM4.RData")

Checking the Fit

After computations finished, which may take between a couple of hours to days, check if model estimation worked by using the function check_results.

check_results(results)
#> ## MPTinR: no pooling
#> Based on asymptotic method, proportion of participants with non-identified parameters:
#>   condition proportion
#> 1 load          0.0208
#> 2 no_load       0.0208
#> 
#> Based on asymptotic CIs, table of non-identified parameters:
#>   condition Var1      n
#> 1 load      d         2
#> 2 no_load   d         1
#> 3 no_load   g         1
#> 
#> Based on PB/MLE method, proportion of participants with non-identified parameters:
#>   condition proportion
#> 1 load          0.177 
#> 2 no_load       0.0312
#> 
#> Based on PB/MLE CIs, table of non-identified parameters:
#>   condition Var1      n
#> 1 load      d        17
#> 2 no_load   g         3
#> 
#> Based on NPB/MLE method, proportion of participants with non-identified parameters:
#>   condition proportion
#> 1 load          0.167 
#> 2 no_load       0.0312
#> 
#> Based on NPB/MLE CIs, table of non-identified parameters:
#>   condition Var1      n
#> 1 load      d        16
#> 2 no_load   g         3
#> 
#> 
#> ## MPTinR: complete pooling
#> No convergence problems.
#> 
#> 
#> ## TreeBUGS, no, simple:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .
#> 
#> 
#> ## TreeBUGS, complete, simple:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .
#> 
#> 
#> ## TreeBUGS, partial, trait:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .
#> 
#> 
#> ## TreeBUGS, partial, trait_uncorrelated:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .
#> 
#> 
#> ## TreeBUGS, partial, beta:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .
#> 
#> 
#> ## TreeBUGS, partial, betacpp:
#> All Rhat < 10 .
#> All effect sample sizes > 2 .

In this example, for the no-pooling asymptotic approaches the rate of participants with non-identified parameters is very low. For the bootstrap-based approaches the results pattern is different. Here we see that the rate of participants with non-identified parameters in the load condition is considerably higher, around .17 versus .03 in the no_load condition. Particularly, the \(d\) parameter shows problematic behavior.

For the Bayesian approaches, the betacpp did not reach an effective sample size \(\mathit{ESS} > 2{,}000\). Increasing the number of iterations by typing mpt_options(n.iter = 2e5), and re-running, should solve this problem.

Returned Object

fit_mpt returns a tibble with an additional class, multiverseMPT, with one row per estimation method. When using the default setting and if all methods succeed, fit_mpt uses ten estimation methods and thus this tibble contains ten rows. The first five columns contain information about data and method and the remaining columns contain the results. Most of the results columns are list columns and inspection of the content is performed most conveniently using packages dplyr and tidyr. We therefore load these packages before taking a glimpse at the columns.

library("dplyr")
library("tidyr")
glimpse(results)
#> Observations: 10
#> Variables: 17
#> $ model            <chr> "2HTSM_Submodel4.eqn", "2HTSM_Submodel4.eqn",...
#> $ dataset          <chr> "Kuhlmann_dl7.csv", "Kuhlmann_dl7.csv", "Kuhl...
#> $ pooling          <chr> "complete", "no", "no", "no", "no", "complete...
#> $ package          <chr> "MPTinR", "MPTinR", "MPTinR", "MPTinR", "Tree...
#> $ method           <chr> "asymptotic", "asymptotic", "PB/MLE", "NPB/ML...
#> $ est_group        <list> [<# A tibble: 8 x 9,   condition parameter c...
#> $ est_indiv        <list> [# A tibble: 0 x 0, <# A tibble: 192 x 11,  ...
#> $ est_rho          <list> [<# A tibble: 0 x 3, # ... with 3 variables:...
#> $ test_between     <list> [<# A tibble: 4 x 11,   parameter core  cond...
#> $ gof              <list> [<# A tibble: 1 x 6,   type  focus stat_obs ...
#> $ gof_group        <list> [<# A tibble: 2 x 7,   condition type  focus...
#> $ gof_indiv        <list> [# A tibble: 0 x 0, <# A tibble: 48 x 8,    ...
#> $ fungibility      <list> [<# A tibble: 0 x 3, # ... with 3 variables:...
#> $ test_homogeneity <list> [<# A tibble: 2 x 4,   condition chisq    df...
#> $ convergence      <list> [<# A tibble: 3 x 4,   condition  rank.fishe...
#> $ estimation       <list> [<# A tibble: 4 x 2,   condition     time_di...
#> $ options          <list> [<# A tibble: 1 x 13,   bootstrap_sampl… n.o...

Bayen and Kuhlman (2011) report a difference in the \(g\) parameter across conditions (they actually report an interaction with a further within-subjects factor, but this is not considered here). Thus, we can take a look at the difference in \(g\) parameter across conditions and methods in the following manner: We first select the column containing the results of the between-condition tests, unnest this column, and then select only data containing the relevant parameter.

results %>% 
  select(pooling:method, test_between) %>% 
  unnest() %>% 
  filter(parameter == "g") %>% 
  print(width = 150)
#> # A tibble: 10 x 14
#>    pooling  package  method             parameter core  condition1
#>    <chr>    <chr>    <chr>              <chr>     <lgl> <chr>     
#>  1 complete MPTinR   asymptotic         g         FALSE no_load   
#>  2 no       MPTinR   asymptotic         g         FALSE no_load   
#>  3 no       MPTinR   PB/MLE             g         FALSE no_load   
#>  4 no       MPTinR   NPB/MLE            g         FALSE no_load   
#>  5 no       TreeBUGS simple             g         FALSE no_load   
#>  6 complete TreeBUGS simple             g         FALSE no_load   
#>  7 partial  TreeBUGS trait              g         FALSE no_load   
#>  8 partial  TreeBUGS trait_uncorrelated g         FALSE no_load   
#>  9 partial  TreeBUGS beta               g         FALSE no_load   
#> 10 partial  TreeBUGS betacpp            g         FALSE no_load   
#>    condition2 est_diff     se       p ci_0.025 ci_0.1  ci_0.9 ci_0.975
#>    <chr>         <dbl>  <dbl>   <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#>  1 load         -0.186 0.0316 NA        -0.248 -0.227 -0.146  -0.125  
#>  2 load         -0.151 0.0781  0.0587   -0.304 -0.251 -0.0511  0.00189
#>  3 load         -0.133 0.0808  0.104    -0.292 -0.237 -0.0298  0.0250 
#>  4 load         -0.133 0.0808  0.104    -0.292 -0.237 -0.0298  0.0250 
#>  5 load         -0.144 0.0286  0        -0.200 -0.181 -0.107  -0.0880 
#>  6 load         -0.185 0.0317  0        -0.248 -0.226 -0.145  -0.123  
#>  7 load         -0.187 0.100   0.0722   -0.378 -0.313 -0.0573  0.0157 
#>  8 load         -0.198 0.0944  0.0438   -0.378 -0.316 -0.0758 -0.00549
#>  9 load         -0.152 0.0706  0.0348   -0.288 -0.242 -0.0607 -0.0109 
#> 10 load         -0.154 0.0706  0.032    -0.290 -0.243 -0.0641 -0.0138

Inspecting the differences across the analysis multiverse shows that the estimated difference is negative in each case and the 95% credibility/confidence intervals around the estimate do not include zero for 6 out of the 10 methods. Only the CIs for the no-pooling frequentist methods as well as the most sophisticated model, the latent trait model, include 0. However, the 80% CIs do not include zero for all methods. Taken together, this provides evidence that the \(g\) parameter is larger in the load compared to the no_load condition.

Plotting Results

The analysis output results is an object of class multiverseMPT, that has its own plot() method. The plot method returns ggplot2 objects, which allows further customization (such as using themes). Type ?plot.multiverseMPT to see the documentation of possible arguments to this method.

To plot group-level parameter estimates use:

plot(results, save = FALSE, "est")

To plot between-subjects comparisons across all parameters:

plot(results, save = FALSE, "test_between")

To plot overall goodness-of-fit use:

plot(results, save = FALSE, "gof1")

To plot group-wise goodness-of-fit use:

plot(results, save = FALSE, "gof2")

References