In this vignette we introduce the functionality of the
`gFormulaMI`

. The package implements an approach to
constructing a G-formula estimator using multiple imputation methods and
software, as described by Bartlett et al (2023).

First we use the simulated dataset provided in the package to
demonstrate how to impute potential outcomes using
`gFormulaImpute`

. We then show how the resulting imputed
datasets can be analysed using the `syntheticPool`

function.
Lastly, we illustrate how the functions can be used when the dataset has
missing values to begin with (i.e. regular missing data).

`gFormulaImpute`

First, we load the package

`library(gFormulaMI)`

The first few rows of the simulated dataset that ships with the package look like

```
head(simDataFullyObs)
#> l0 a0 l1 a1 l2 a2 y
#> 1 -0.6332580 1 0.59545963 1 2.7644551 1 3.4091077
#> 2 0.1182498 0 0.62444135 0 -0.9677095 1 -0.8586307
#> 3 0.4833364 1 -0.53836723 0 -0.7088896 0 -0.4378790
#> 4 0.6775931 0 0.57759241 1 2.0477487 1 3.8760967
#> 5 0.4278554 1 1.50188037 1 3.4155576 1 4.4027209
#> 6 0.6437421 1 -0.01322485 1 2.2178911 1 4.1302417
```

Here the `l`

variables correspond to a confounder measured
at baseline (`l0`

) and two follow-up time points
(`l1`

and `l2`

). The time-varying binary treatment
variable is stored in `a0`

, `a1`

and
`a2`

. The final outcome variable is `y`

.

Next we use `gFormulaImpute`

to impute potential outcomes
under two treatment regimes of interest. To do this,
`gFormulaImpute`

uses the `mice`

multiple
imputation function from the `mice`

package. If passed a
regular data frame, as here, `gFormulaImpute`

expects (and
requires) there to be no missing values in the data frame. To run, we
have to tell `gFormulaImpute`

which variables correspond to
the time-varying treatment and the treatment regime(s) that we want it
to create imputations for:

```
set.seed(7626)
#impute synthetic datasets under two regimes of interest
<- gFormulaImpute(data=simDataFullyObs,M=10,trtVars=c("a0","a1","a2"),
imps trtRegimes=list(c(0,0,0),c(1,1,1)))
#> [1] "Input data is a regular data frame."
#> [1] "Variables imputed using:"
#> l0 a0 l1 a1 l2 a2 y regime
#> "norm" "" "norm" "" "norm" "" "norm" ""
#> [1] "Predictor matrix is set to:"
#> l0 a0 l1 a1 l2 a2 y regime
#> l0 0 0 0 0 0 0 0 0
#> a0 1 0 0 0 0 0 0 0
#> l1 1 1 0 0 0 0 0 0
#> a1 1 1 1 0 0 0 0 0
#> l2 1 1 1 1 0 0 0 0
#> a2 1 1 1 1 1 0 0 0
#> y 1 1 1 1 1 1 0 0
#> regime 1 1 1 1 1 1 1 0
```

Here we have called `gFormulaImpute`

requesting 10
imputations to be generated for two regimes of interest, corresponding
to no treatment at each time point (0,0,0) and treatment at each time
point (1,1,1).

The printed output above tells us what imputation methods have been
used to impute (simulate) the time-varying confounders and outcome. By
default, `gFormulaImpute`

specifies to impute numeric
variables using normal linear regression, which is why here
`l0`

, `l1`

, `l2`

and `y`

are
being imputed using method `norm`

. If there had been binary
factors as time-varying confounders, these would be imputed using
logistic regression. Factor time-varying confounders which are unordered
are imputed using polytomous regression and a proportional odds model is
used for ordered factors. If you want to modify these defaults, you can
pass a `method`

vector to `gFormulaImpute`

, which
specifies which of the imputation methods available in the
`mice`

package to use when imputing each column of the data
frame.

The output also shows the `predictorMatrix`

argument used
when calling the `mice`

imputation function. This shows, in a
given row, which other variables (indicated by a 1) will be used as
covariates in the imputation model for the variable labelled in that
row. Thus here, `l1`

is being imputed using `l0`

and `a0`

. For variables which are not being imputed (as
indicated by an empty entry in the vector of imputation methods
printed), the corresponding row in the predictor matrix has no effect on
anything. In particular, the treatment variables `a0`

,
`a1`

and `a2`

are not being imputed based on a
regression model.

You will notice that the printed `predictorMatrix`

has 1s
in the lower diagonal and 0s in the upper diagonal. This is because
g-formula and hence `gFormulaImpute`

, imputes sequentially
forwards in time, using the past (i.e. to the left) variables as
covariates. Because of this, the data frame you pass to
`gFormulaImpute`

must have the variables ordered in time as
in the example above, so that the correct covariates are used in each
model. If you wish to modify the predictor matrix used, you can pass a
custom one using the `predictorMatrix`

argument of
`gFormulaImpute`

.

Note that unlike in the usual missing data situation, no iteration is
required when `gFormulaImpute`

calls the `mice`

imputation function, and thus when `gFormulaImpute`

calls
`mice`

it sets its `maxit`

argument to 1.

In our call above we asked `gFormulaImpute`

to create 10
imputations. Later when we analyse the imputations, the special pooling
rules we will apply can, with a small number of imputations, produce
negative variances for parameter estimates. To avoid this, we recommend
using at least 25 imputations in general (we used 10 here for the sake
of speed).

In our call above we did not specify the `nSim`

argument.
This argument specifies how many individuals to simulate longitudinal
histories for in each of the synthetic imputed datasets. Since we did
not specify a value, the default is to simulate the same number as the
number of rows in the data frame passed to `gFormulaImpute`

(here 5,000). If more than one treatment regime is specified,
`nSim`

rows are simulated for each regime.

`gFormulaImpute`

creates a set of synthetic imputed
datasets. That is, the imputed datasets created contain imputed
(simulated) longitudinal histories based on the fitted models, under the
treatment regimes specified. The original rows in the data frame passed
to `gFormulaImpute`

do not appear in the imputed datasets
returned to us. The first few rows of the first imputed dataset are:

```
head(mice::complete(imps))
#> l0 a0 l1 a1 l2 a2 y regime
#> 1 0.6177218 0 0.6397720 0 -0.0003761256 0 0.5996190 1
#> 2 0.2044563 0 -0.6073790 0 -1.7856679957 0 -1.7246571 1
#> 3 0.4327859 0 -0.3925652 0 -0.7555212806 0 -1.5840459 1
#> 4 0.6059553 0 2.4764117 0 0.7312018102 0 0.6342567 1
#> 5 0.2985589 0 0.9223022 0 0.0734605439 0 -1.1511625 1
#> 6 -0.1763264 0 0.6122369 0 0.8838681514 0 2.0858243 1
```

We see that in the first rows of the first imputed dataset,
`a0`

, `a1`

and `a2`

are always zero,
because the first treatment regime we specified to impute for was
(0,0,0). We also have an additional variable, called
`regime`

. This factor variable records which of the specified
treatment regimes each row corresponds to. Thus here, in the first few
rows of the data frame, the regime is 1.

To analyse the imputed datasets we first run our desired analysis of
the imputed longitudinal histories. Here we will simply compare the mean
of the final outcome variable `y`

between the two regimes we
have imputed for. To do this, we fit a linear model with `y`

as the dependent variable and `regime`

as covariate:

`<- with(imps, lm(y~factor(regime))) fits `

Because the imputed datasets we have analysed are fully synthetic, we
cannot (or rather should not) pool our estimates using Rubin’s standard
combination rules (e.g. as implemented in `pool`

in the
`mice`

package). Doing so results in variances which are
larger than they should be. Instead we use the synthetic imputation
combination rules developed by Raghunathan et al (2003), implemented in
the `syntheticPool`

function:

```
syntheticPool(fits)
#> Estimate Within Between Total df
#> (Intercept) -0.02071539 0.0008125695 0.001763004 0.001126735 3.038045
#> factor(regime)2 2.96593502 0.0016251389 0.004203106 0.002998278 3.784951
#> 95% CI L 95% CI U p
#> (Intercept) -0.1267874 0.08535658 5.803156e-01
#> factor(regime)2 2.8104386 3.12143146 1.304839e-06
```

Here the intercept corresponds to the mean of `y`

under
our first regime (0,0,0). The `factor(regime)2`

coefficient
corresponds to difference in the mean of `y`

between the
second regime and the first. Here we see that the second regime has a
mean outcome about 3 higher than the first. As well as the point
estimate, we see the within, between and total imputation variances.
Here we can see that the total variances are less than the between,
which never happens with Rubin’s regular pooling rules. This is because
in the synthetic pooling rules of Raghunthan et al (2003), the total
variance is the between minus the within imputation variance (plus a
correction for the finite number of imputations performed).

Often the longitudinal dataset we have has some missing values. In
this context, one approach is to use multiple imputation to impute these
missing values using imputation software, assuming missing at random.
Having imputed these, we can then pass the imputed datasets to
`gFormulaImpute`

to perform the synthetic imputation
step.

To illustrate these steps, let’s now create a new dataset from the simulated one, making some values missing:

```
<- simDataFullyObs
simDataMissing $l0[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l1[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l2[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$y[runif(nrow(simDataMissing))<0.2] <- NA simDataMissing
```

Here we have simply made around 20% of values missing in
`l0`

, `l1`

, `l2`

and `y`

,
completely randomly.

Next, we impute the `simDataMissing`

data frame using the
`mice`

function:

`<- mice::mice(simDataMissing,m=10,printFlag=FALSE) simDataMissingImps `

In a real substantive analysis we should take more care to think
about how we specify the imputation models. For more discussion of this
point, see Bartlett et al
(2023). In particular, note that the default behaviour of the
`mice`

function is to impute numeric variables using the
predictive mean matching method, rather than normal linear regression
(as in the `gFormulaImpute`

function).

We can now pass our multiply imputed datasets to
`gFormulaImpute`

to create the synthetic imputations as
before, but this time let’s try and generate 20 synthetic imputated
datasets:

```
<- gFormulaImpute(data=simDataMissingImps,M=20,trtVars=c("a0","a1","a2"),
imps2 trtRegimes=list(c(0,0,0),c(1,1,1)))
#> [1] "Input data is a mice created multiple imputation object."
#> [1] "Value passed to M being ignored."
#> [1] "Number of synthetic imputations to be generated set to 10 as in mids object passed to gFormulaImpute."
#> [1] "Variables imputed using:"
#> l0 a0 l1 a1 l2 a2 y regime
#> "norm" "" "norm" "" "norm" "" "norm" ""
#> [1] "Predictor matrix is set to:"
#> l0 a0 l1 a1 l2 a2 y regime
#> l0 0 0 0 0 0 0 0 0
#> a0 1 0 0 0 0 0 0 0
#> l1 1 1 0 0 0 0 0 0
#> a1 1 1 1 0 0 0 0 0
#> l2 1 1 1 1 0 0 0 0
#> a2 1 1 1 1 1 0 0 0
#> y 1 1 1 1 1 1 0 0
#> regime 1 1 1 1 1 1 1 0
```

The output looks much the same as the first time we called
`gFormulaImpute`

, apart from the first few lines of output.
`gFormulaImpute`

can see that there are only 10 imputations
of the original dataset, and it refuses to generate a different number
of imputations to the number in the `mids`

multiple
imputation dataset object we have passed to it. The `imps2`

object thus contains 10 imputations, which can be analysed using
`syntheticPool`

in the same way as before.

Bartlett JW, Olarte Parra C, Daniel RM. G-formula for causal inference via multiple imputation. 2023 arXiv 2301.12026

Raghunathan TE, Reiter JP, Rubin DB. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), p.1-16.