# Purpose of this vignette

The main purpose of PUMP is to estimate power, MDES, and sample size requirements. As a separate task, PUMP provides extra functions that allow a user to generate data that simulates multi-level RCTs. These functions may not be relevant for most users, but we provide them for users who may find it useful. This vignette documents how to use the simulation functions.

# Choosing parameter values

The user first sets parameter values that will inform the data generating process. For full explanations of parameters, see the Technical Appendix.

Note that some of these parameters directly influence things like power calculations, while others are "nuisance" parameters that are nevertheless needed to generate a full dataset.

model.params.list <- list(
M = 3                             # number of outcomes
, J = 30                          # number of schools
, K = 10                          # number of districts
# (for two-level model, set K = 1)
, nbar = 50                       # number of individuals per school
, rho.default = 0.5               # default rho value (optional)
################################################## impact
, MDES = 0.125                    # minimum detectable effect size
################################################## level 3: districts
, R2.3 = 0.1                      # percent of district variation
# explained by district covariates
, ICC.3 = 0.2                     # district intraclass correlation
, omega.3 = 0.1                   # ratio of district effect size variability
# to random effects variability
################################################## level 2: schools
, R2.2 = 0.1                      # percent of school variation
# explained by school covariates
, ICC.2 = 0.2                     # school intraclass correlation
, omega.2 = 0.1                   # ratio of school effect size variability
# to random effects variability
################################################## level 1: individuals
, R2.1 = 0.1                      # percent of indiv variation explained
# by indiv covariates
)

Above is the minimum set of parameters the user can provide. However, the user can also make additional choices that influence the simulated dat. A couple of notes about possible choices: - By default, the function will generate a vector of school and district assignments S.id and D.id that are evenly split, e.g. with an equal number of schools in each district and an equal number of students in each school. However, the user can also provide their own vector of assignments if they require a specific setup. - If the user specifies a rho.default value, all matrices will be populated using the assumed $$\rho$$. The user can instead provide their own $$\rho$$ matrices.

The full set of parameters the user can specify is below.

M <- 3
rho.default <- 0.5
default.rho.matrix <- gen_corr_matrix(M = M, rho.scalar = rho.default)
default.kappa.matrix <- matrix(0, M, M)

model.params.list <- list(
M = 3                             # number of outcomes
, J = 30                          # number of schools
, K = 10                          # number of districts
# (for two-level model, set K = 1)
, nbar = 50                       # number of individuals per school
, S.id = NULL                     # N-length vector of school assignments
, D.id = NULL                     # N-length vector of district assignments
################################################## grand mean outcome and impact
, Xi0 = 0                         # scalar grand mean outcome under no treatment
, MDES = rep(0.125, M)            # minimum detectable effect size
################################################## level 3: districts
, R2.3 = rep(0.1, M)              # percent of district variation
# explained by district covariates
, rho.V = default.rho.matrix      # MxM correlation matrix of district covariates
, ICC.3 = rep(0.2, M)             # district intraclass correlation
, omega.3 = rep(0.1, M)           # ratio of district effect size variability
# to random effects variability
, rho.w0 = default.rho.matrix     # MxM matrix of correlations for district random effects
, rho.w1 = default.rho.matrix     # MxM matrix of correlations for district impacts
, kappa.w =  default.kappa.matrix # MxM matrix of correlations between district
# random effects and impacts
################################################## level 2: schools
, R2.2 = rep(0.1, M)              # percent of school variation
# explained by school covariates
, rho.X = default.rho.matrix      # MxM correlation matrix of school covariates
, ICC.2 = rep(0.2, M)             # school intraclass correlation
, omega.2 = rep(0.1, M)           # ratio of school effect size variability
# to random effects variability
, rho.u0 = default.rho.matrix     # MxM matrix of correlations for school random effects
, rho.u1 = default.rho.matrix     # MxM matrix of correlations for school impacts
, kappa.u = default.kappa.matrix  # MxM matrix of correlations between school
# random effects and impacts
################################################## level 1: individuals
, R2.1 = rep(0.1, M)              # percent of indiv variation explained
# by indiv covariates
, rho.C = default.rho.matrix      # MxM correlation matrix of individual covariates
, rho.r = default.rho.matrix      # MxM matrix of correlations for individual residuals
)

# Simulate data

Once the user has chosen the model parameters, the only remaining choice is Tbar, the proportion of units assigned to the treatment. This parameter is not considered a modeling parameter as it instead informs the randomization process.

sim.data <- gen_sim_data(d_m = 'd3.3_m3rc2rc', model.params.list, Tbar = 0.5)

The final output contains a list of the following vectors: - potential outcomes Y0 and Y1. - observed outcomes Yobs. - treatment assignment T.x. - covariates at each level: level 3 V.k, level 2 X.jk, level 1 C.ijk. - ID assignments at each level: level 2 S.id, level 3 D.id.

Note that the simulated data contains both observed parameters and unobserved parameters (the unobserved potential outcome, depending on treatment assignment).

# Simulation process

We briefly walk through the steps conducted by the gen_sim_data function in case the user wants to inspect intermediate steps of the process.

First, the user-given parameters are then converted into parameters that inform the data-generating process (DGP). For example, a certain value of $$R^2$$ is converted in a coefficient value.

dgp.params.list <- convert_params(model.params.list)

Next, we generate a set of full simulation data, but without assuming any treatment assignment has occurred. The simulated data includes both unobserved and unobserved quantities, such as both potential outcomes.

sim.data <- gen_base_sim_data(dgp.params.list)

Finally, we generate the treatment assignment, and the observed outcomes $$Y^{obs}$$. At this point, we need to specify the design and Tbar to generate the correct treatment assignment.

d_m <- 'd3.3_m3rc2rc'
sim.data$T.x <- gen_T.x( d_m = d_m, S.id = sim.data$ID$S.id, D.id = sim.data$ID$D.id, Tbar = 0.5 ) sim.data$Yobs <- gen_Yobs(sim.data, T.x = sim.data\$T.x)