After assessing balance and deciding on a matching specification, it comes time to estimate the effect of the treatment in the matched sample. How the effect is estimated and interpreted depends on the desired estimand and the type of model used (if any). In addition to estimating effects, estimating the uncertainty of the effects is critical in communicating them and assessing whether the observed effect is compatible with there being no effect in the population. This guide explains how to estimate effects after various forms of matching and with various outcome types. There may be situations that are not covered here for which additional methodological research may be required, but some of the recommended methods here can be used to guide such applications.

This guide is structured as follows: first, information on the concepts related to effect and standard error estimation is presented below. Then, instructions for how to estimate effects and standard errors are described. This section is split up first by the type of matching method performed (matching without replacement, matching with replacement, full matching, and subclassification) and then by the type of outcome (continuous, binary, and survival). Finally recommendations for reporting results and tips to avoid making common mistakes are presented.

Before an effect is estimated, the estimand must be specified and clarified. Although some aspects of the estimand depend not only on how the effect is estimated after matching but also on the matching method itself, other aspects must be considered at the tine of effect estimation and interpretation. Here, we consider three aspects of the estimand: the population the effect is meant to generalize to (the target population), the effect measure, and whether the effect is marginal or conditional.

**The target population.** Different matching methods allow you to estimate effects that can generalize to different target populations. The most common estimand in matching the average treatment effect in the treated (ATT), which is the average effect of treatment for those who receive treatment. This estimand is estimable for matching methods that do not change the treated units (i.e., by weighting or discarding units) and is requested in `matchit()`

by setting `estimand = "ATT"`

(which is the default). The average treatment effect in the population (ATE) is the average effect of treatment for the population from which the sample is a random sample. This estimand is estimable only for methods that allow the ATE and do not discard units from the sample, which in `MatchIt`

is limited to full matching and subclassification when setting `estimand = "ATE"`

. When treated units are discarded (e.g., through the use of common support restrictions, calipers, or [coarsened] exact matching), the estimand corresponds to neither the population ATT nor the population ATE, but rather to an average treatment effect in the remaining matched sample (ATM), which may not correspond to any specific target population.

**Marginal and conditional effects.** A marginal effect is a comparison between the expected potential outcome under treatment and the expected potential outcome under control. This is the same quantity estimated in randomized trials without blocking or covariate adjustment and is particularly useful for quantifying the overall effect of a policy or population-wide intervention. A conditional effect is the comparison between the expected potential outcomes in the treatment groups within strata. This is useful for identifying the effect of a treatment for an individual patient or a subset of the population.

**Effect measures.** The outcome types we consider here are continuous, with the effect measured by the mean difference; binary, with the effect measured by the risk difference (RD), risk ratio (RR), or odds ratio (OR); and time-to-event (i.e., survival), with the effect measured by the hazard ratio (HR). The RR, OR, and HR are *noncollapsible* effect measures, which means the marginal effect on that scale is not a (possibly) weighted average of the conditional effects within strata, even if the stratum-specific effects are of the same magnitude. For these effect measures, it is critical to distinguish between marginal and conditional effects because different statistical methods target different types of effects. The mean difference and RD are *collapsible* effect measures, so the same methods can be used to estimate marginal and conditional effects.

In this guide, we will provide examples for estimating the ATT, ATE, and ATM for each of the three outcomes types. Our primary focus will be on marginal effects, which are appropriate for all effect measures, easily interpretable, and require few modeling assumptions. We will include a short section on estimating the conditional OR. The “Common Mistakes” section includes examples of commonly used methods that estimate conditional rather than marginal effects and should not be used when marginal effects are desired.

The type and form of the model used to estimate the treatment effect depend on the effect measure desired, whether a marginal or conditional effect is desired, whether covariates are to be included in the model, and whether effect modification by the covariates is present. For continuous outcomes, one can use a linear model regressing the outcome on the treatment; for binary outcomes, one can use a generalized linear model with a link function appropriate for the desired effect measure (e.g., a logistic link for the OR); for time-to-event outcomes, one can use a Cox proportional hazards model.

An additional decision to make is whether (and how) to include covariates in the outcome model. One may ask, why use matching at all if you are going to model the outcome with covariates anyway? Matching reduces the dependence of the effect estimate on correct specification of the outcome model; this is the central thesis of Ho et al. (2007). Including covariates in the outcome model after matching has several functions: it can increase precision in the effect estimate, reduce the bias due to residual imbalance, and make the effect estimate “doubly robust,” which means it is consistent if either the matching reduces sufficient imbalance in the covariates or if the outcome model is correct. For these reasons, we recommend covariate adjustment after matching when possible. There is some evidence that covariate adjustment is most helpful for covariates with standardized mean differences greater than .1 (Nguyen et al. 2017), so these covariates and covariates thought to be highly predictive of the outcome should be prioritized in treatment effect models if not all can be included due to sample size constraints.

For continuous outcomes, we can simply include the covariates in the regression of the outcome on the treatment in the matched sample (i.e., using the matching weights). Because the mean difference is collapsible, the effect estimate conditioning on the covariates is still a marginal effect estimate. This is not the case with binary and time-to-event outcomes; including covariates in the outcome model makes the treatment effect a conditional effect. To recover a marginal effect while including covariates in an outcome model, one has to perform a marginal effects procedure, also known as g-computation (Snowden, Rose, and Mortimer 2011) or regression estimation (Schafer and Kang 2008), which involves simulating the average potential outcomes under each treatment level and computing the effect as a contrast between those average potential outcomes. G-computation can also be used to estimate marginal effects when modeling effect modification by the covariates. We demonstrate examples of this below.

Although there are many possible ways to include covariates (i.e., not just main effects but interactions, smoothing terms like splines, or other nonlinear transformations), it is important not to engage in specification search (i.e., trying many outcomes models in search of the “best” one). Doing so can invalidate results and yield a conclusion that fails to replicate. For this reason, we recommend only including the same terms included in the propensity score model unless there is a strong *a priori* and justifiable reason to model the outcome differently. Second, it is important not to interpret the coefficients and tests of the other covariates in the outcome model. These are not causal effects and their estimates may be severely confounded. Only the treatment effect estimate can be interpreted as causal assuming the relevant assumptions about unconfoundedness are met. Inappropriately interpreting the coefficients of covariates in the outcome model is known as the Table 2 fallacy (Westreich and Greenland 2013). To avoid this, in all examples that incorporate covariates in the outcome model, we restrict the output of outcome regression models to just the treatment coefficient.

Uncertainty estimation (i.e., of standard errors, confidence intervals, and p-values) may consider the variety of sources of uncertainty present in the analysis, including (but not limited to!) estimation of the propensity score (if used), matching (i.e., because treated units might be matched to different control units if others had been sampled), and estimation of the treatment effect (i.e., because of sampling error). In general, there are no analytic solutions to all these issues, so much of the research done on uncertainty estimation after matching has relied on simulation studies. The two primary methods that have been shown to perform well in matched samples are using cluster-robust standard errors and the bootstrap.

**Robust standard errors.** Also known as sandwich standard errors (due to the form of the formula for computing them), heteroscedasticity-consistent standard errors, or Huber-White standard errors, robust standard errors are an adjustment to the usual maximum likelihood or ordinary least squares standard errors that are robust to violations of some of the assumptions required for usual standard errors to be valid (MacKinnon and White 1985). Although there has been some debate about their utility (King and Roberts 2015), robust standard errors rarely degrade inferences and often improve them. Generally, robust standard errors **must** be used when any non-uniform weights are included in the estimation (e.g., with full matching or inverse probability weighting).

**Cluster-robust standard errors.** A version of robust standard errors known as cluster-robust standard errors (Liang and Zeger 1986) can be used to account for dependence between observations within clusters (e.g., matched pairs). Abadie and Spiess (2019) demonstrate analytically that cluster-robust standard errors are generally valid after matching, whereas regular robust standard errors can over- or under-estimate the true sampling variability of the effect estimator depending on the specification of the outcome model (if any) and degree of effect modification. A plethora of simulation studies have further confirmed the validity of cluster-robust standard errors after matching (e.g., Austin 2009, 2013a; Austin and Small 2014; Gayat et al. 2012; Wan 2019). Given this evidence favoring the use of cluster-robust standard errors, we recommend them in most cases and use them judiciously in the this guide^{1}.

Here, we will use linear and generalized linear models as implemented by `lm()`

and `glm()`

and use `lmtest::coeftest()`

and `lmtest::coefci()`

along with `sandwich::vcovHC()`

for robust standard errors and `sandwich::vcovCL()`

for cluster-robust standard errors. There are several “types” (e.g., HC0, HC1, etc.) of robust standard errors that adjust the original standard errors in different ways; although more research is required on the benefits of using different types for estimating standard errors after matching, we use the default types here.

Bootstrapping is a technique used to simulate the sampling distribution of an estimator by repeatedly drawing samples with replacement and estimating the effect in each bootstrap sample (Efron and Tibshirani 1993). From the bootstrap distribution, standard errors and confidence interval can be computed in several ways, including using the standard deviation of the bootstrap estimates as the standard error estimate or using the 2.5 and 97.5 percentiles as 95% confidence interval bounds. Bootstrapping tends to be most useful when no analytic estimator of a standard error is possible or has been derived yet. Although Abadie and Imbens (2008) found analytically that the bootstrap is inappropriate for matched samples, simulation evidence has found it to be adequate in many cases (Hill and Reiter 2006; Austin and Small 2014; Austin and Stuart 2017). It is the most accessible method for computing standard errors after g-computation for nonlinear models.

Typically, bootstrapping involves performing the entire estimation process in each bootstrap sample, including propensity score estimation, matching, and effect estimation. This tends to be the most straightforward route, though intervals from this method may be conservative in some cases (i.e., they are wider than necessary to achieve nominal coverage) (Austin and Small 2014). Less conservative and more accurate intervals have been found when using different forms of the bootstrap, including the wild bootstrap develop by Bodory et al. (2020) and the matched bootstrap described by Austin and Small (2014) and Abadie and Spiess (Abadie and Spiess 2019). The block bootstrap involves sampling matched pairs/strata of units from the matched sample and performing the analysis within each sample composed of the sampled pairs. Abadie and Spiess (Abadie and Spiess 2019) derived analytically that the block bootstrap is valid for estimating SEs and CIs in the same circumstances cluster robust SEs are; indeed, the block bootstrap SE is known to approximate the cluster-robust SE (Cameron and Miller 2015). We use `boot()`

from the `boot`

package to implement bootstrapping.

With bootstrapping, more bootstrap replications are always better but can take time and increase the chances that at least one error will occur within the bootstrap analysis (e.g., a bootstrap sample with zero treated units or zero units with an event). In general, numbers of replications upwards of 999 are recommended, with values one less than a multiple of 100 preferred to avoid interpolation when using the percentiles as confidence interval limits (MacKinnon 2006). There are several methods of computing bootstrap confidence intervals, but the bias-corrected accelerated (BCa) bootstrap confidence interval often performs best (Austin and Small 2014; Carpenter and Bithell 2000) and is easy to implement, simply by setting `type = "bca"`

in the call to `boot.ci()`

after running `boot()`

^{2}.

Below, we describe effect estimation after several methods of matching. We consider four broad types of matching that require their own specific methods for estimation effects: 1) pair matching without replacement, 2) pair matching with replacement, 3) full matching, and 4) stratification. In some cases, methods for estimating effects are similar across methods, and we err on the side of redundancy here in our instructions. We also consider three different outcome types: 1) continuous, 2) binary, and 3) survival.

We’ll be using a simulated toy dataset `d`

with several outcome types. Code to generate the dataset is at the end of this document. The focus here is not on evaluating the methods but simply on demonstrating them. In all cases, the correct propensity score model is used. Below we display the first six rows of `d`

:

`head(d)`

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73
```

`A`

is the treatment variable, `X1`

through `X9`

are covariates, `Y_C`

is a continuous outcome, `Y_B`

is a binary outcome, and `Y_S`

is a survival outcome.

We will need to the following packages to perform the desired analyses:

`lmtest`

provides the`coeftest()`

and`coefci()`

functions for estimating coefficients, standard errors, and confidence intervals incorporating robust standard errors`sandwich`

provides robust and cluster robust standard errors through the`vcovHC()`

and`vcovCL()`

functions, respectively`boot`

provides the`boot()`

and`boot.ci()`

functions for performing bootstrapping and estimating bootstrap effects, standard errors, and confidence intervals.`survival`

provides the functions`survdiff()`

and`coxph()`

to perform the log-rank test for differences in survival curves and estimate the coefficients in a Cox-proportional hazards model for the marginal hazard ratio, respectively, which we will use for survival outcomes.

Of course, we also need `MatchIt`

to perform the matching.

```
library("MatchIt")
library("lmtest")
library("sandwich")
library("boot")
library("survival")
```

Other packages may be of use but are not used here. The `margins`

package can be useful for computing marginal effects and standard errors without bootstrapping. Some examples using `margins`

are presented. The `survey`

package can be used to estimate robust standard errors incorporating weights and provides functions for survey-weighted generalized linear models and Cox-proportional hazards models. It is often used with propensity score weighting. The `Zelig`

package provides a broad interface to estimating marginal and conditional effects using simulation and was included in the original documentation for `MatchIt`

. The `lme4`

(and `lmerTest`

) package performs mixed effects modeling, which can be useful for accounting for pair membership or other clustering features of the data.

Because different matching methods require different treatments, instructions for each method are organized in the following sections, as designated by the table below. It is important to ensure the right methods are used with the matching specification used in order for the estimated effects and standard errors to be valid.

Section | Matching Method |
---|---|

After Pair Matching Without Replacement | Nearest neighbor matching without replacement |

- | Optimal matching |

- | Genetic matching without replacement |

- | Coarsened exact matching with `k2k = TRUE` |

After Pair Matching With Replacement | Nearest neighbor matching with replacement |

- | Genetic matching with replacement |

After Full Matching | Full matching |

After Stratum Matching | Propensity score subclassification |

- | Exact matching |

- | Coarsened exact matching with `k2k = FALSE` (the default) |

Pair matching without replacement yields the simplest way to estimate treatment effects and standard errors. In general, whether a caliper, common support restriction, exact matching specification, or \(k\):1 matching specification is used, estimating the effect in the matched dataset (i.e., the output of `match.data()`

) is straightforward and involves fitting a model for the outcome that incorporates the matching weights^{3}.

First, we will perform nearest 1:1 neighbor propensity score matching without replacement.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNN + X7 + X8 + X9, data = d)
X6
mNN
```

```
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 882 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mNN)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.17254 -1.4283 -0.4103 -2.3606 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 1 167
## 3 0 0.17677 0.7905 -0.8436 0.8237 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1 210
## 5 1 0.35631 -1.8121 0.8135 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 1 322
## 9 0 0.78080 1.3137 0.6580 0.8540 1 0.9495 -0.5731 -0.2362 -0.14580 15.89771 1 67.53 0.15751 1 3
## 10 1 -0.56513 -0.1053 -0.1369 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.07888 1 113.70 0.16697 1 1
## 11 0 0.07053 -0.1320 -1.8882 0.3077 0 -1.6558 0.8405 -0.5723 -1.01382 -6.55597 0 1632.69 0.45623 1 298
```

Typically one would assess balance and ensure that this matching specification works, but we will skip that step here to focus on effect estimation. See `vignette("MatchIt")`

and `vignette("assessing-balance")`

for more information on this necessary step. Because we did not use a caliper, the target estimand is the ATT.

For continuous outcomes, estimating effects is fairly straightforward using linear regression. We perform all analyses using the matched dataset, `md`

, which contains only units retained in the sample.

**Without covariate adjustment.** To estimate the effect without covariate adjustment, we can run the following:

```
#Linear model without covariates
<- lm(Y_C ~ A, data = md, weights = weights) fit1
```

First, we will estimate the standard errors while accounting for pair membership using cluster-robust standard errors.

```
#Cluster-robust standard errors
coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.779 0.303 5.88 5.8e-09 ***
## A 2.114 0.409 5.18 2.8e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Using the bootstrap.** We demonstrate bootstrapping here using the block bootstrap as recommended by Abadie and Spiess (2019). The process is generalizable to other matching methods for which cluster-robust standard errors are valid and to other outcome types. We first need to write a function, `est_fun`

, which takes in a vector of pair identifiers and an ordering and outputs an effect estimate. This function should include the effect estimation, though no standard error is required.

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Effect estimation
<- lm(Y_C ~ A, data = md_boot, weights = weights)
fit_boot
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}
<- boot(pair_ids, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = pair_ids, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.114 0.003777 0.3862
```

`boot.ci(boot_est, type = "bca")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 1.317, 2.758 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
```

The value `t1*`

in the `Bootstrap Statistics`

table contains the effect estimate and its bootstrap standard error, and the interval in the confidence interval output is the bootstrap confidence interval computed using the specified type (here, `"bca"`

).

**With covariate adjustment.** Including covariates in the outcome model is straightforward with a continuous outcome. We can include main effects for each variable and use the coefficient on the treatment as the treatment effect estimate. It can also be helpful to include interactions between the covariates and treatment if effect modification is suspected, though it is important to center the covariates at their means in the focal (i.e., treated) group if the coefficient on the treatment is to be interpreted as an average treatment effect estimate (which we demonstrate with full matching). A marginal effects procedure can also be used, which we demonstrate with binary outcomes. Below we simply include main effects of each covariate, which is the most typical way to include covariates.

```
#Linear model with covariates
<- lm(Y_C ~ A + X1 + X2 + X3 + X4 + X5 +
fit3 + X7 + X8 + X9, data = md,
X6 weights = weights)
coeftest(fit3, vcov. = vcovCL, cluster = ~subclass)["A",,drop=FALSE]
```

```
## Estimate Std. Error t value Pr(>|t|)
## A 2.029 0.3202 6.338 3.735e-10
```

As previously mentioned, it is important to avoid interpreting coefficients in the outcome model on predictors other than the treatment, as the coefficients do no correspond to causal effects and are likely confounded. In the code above, we restricted the output to just the treatment effect. Note that sometimes the intercept of the outcome model can be useful as an estimate of the average potential outcome under control, but the covariates in the model (if any) must be centered at their means in the target population to allow for this interpretation. Without centering the covariates, the intercept is uninterpretable.

For binary outcomes, effect estimation can be a bit more challenging, especially when including covariates. There are several measures of the effect one can consider, which include the odds ratio (OR), risk ratio/relative risk (RR), and risk difference (RD).

**Without covariate adjustment.** When omitting covariates from the outcome model, effect and standard error estimation is fairly straightforward. Below we demonstrate how to estimate the marginal log OR after nearest neighbor matching without replacement.

```
#Generalized linear model without covariates
<- glm(Y_B ~ A, data = md, weights = weights,
fit4 family = binomial(link = "logit"))
```

By specifying `link = "logit"`

, we fit a logistic regression model to estimate the marginal OR for treatment. To estimate the marginal RR we can specify `link = "log"`

, and to estimate the RD we can specify `link = "identity"`

or use `lm()`

instead of `glm()`

. Below we estimate cluster-robust standard errors, though it should be noted that these standard errors can be biased for the OR and should be used with caution (Austin 2007). Because we want the OR and the effects are estimated on the log OR scale, we have to exponentiate the coefficient on treatment to arrive at the OR (one would do this when estimating the OR or RR, but not the RD).

```
#Cluster-robust standard errors
coeftest(fit4, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.939 0.106 -8.85 < 2e-16 ***
## A 0.834 0.143 5.82 5.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit4)) #OR`

```
## (Intercept) A
## 0.3912 2.3030
```

**With covariate adjustment and bootstrapping.** If we want to include covariates in the model, we have to do some additional work to estimate the effect and its standard error. This is because the coefficient on treatment in a nonlinear model with covariates corresponds to the conditional rather than marginal effect. To estimate a marginal effect, we have to perform a marginal effects procedure, which is equivalent to g-computation in the matched set. First we estimate an outcome model including the covariates, and then we use the predictions from the outcome model to compute the contrast of the average potential outcomes under treatment and control. Bootstrapping is the most straightforward method to estimate standard errors. We demonstrate this below using the block bootstrap as recommended by Abadie and Spiess (2019).

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Fitting outcome the model
<- glm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9, data = md_boot,
X6 family = binomial(link = "logit"),
weights = weights)
#Estimate potential outcomes for each unit
#Under control
$A <- 0
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P0 w = md_boot$weights)
<- P0 / (1 - P0)
Odds0
#Under treatment
$A <- 1
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P1 w = md_boot$weights)
<- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(pair_ids, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = pair_ids, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.225 0.0291 0.2727
```

`boot.ci(boot_est, type = "bca")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 1.785, 2.827 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
```

To use bootstrapping for the RR or RD, the part of the code that computes the marginal OR can be replaced with code to compute the marginal RR (`P1 / P0`

) or marginal RD (`P1 - P0`

). Interactions between the treatment and covariates can be included in the outcome model; no other part of the procedure needs to be changed. Doing so can add robustness when effect modification is possible.

There are several measures of effect size for survival outcomes. When using the Cox proportional hazards model, the quantity of interest is the HR between the treated and control groups. As with the OR, the HR is non-collapsible, which means the estimated HR will only be a valid estimate of the marginal HR when no other covariates are included in the model. Other effect measures, such as the difference in mean survival times or probability of survival after a given time, can be treated just like continuous and binary outcomes as previously described. Here we describe estimating the marginal HR.

**Without covariate adjustment.** Below we demonstrate estimation of the marginal hazard ratio using `coxph()`

from the `survival`

package. To request cluster-robust standard errors as recommended by Austin (2013b), we need to supply pair membership (stored in the `subclass`

column of `md`

) to the `cluster`

argument and set `robust = TRUE`

.

```
#Cox Regression for marginal HR
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights, cluster = subclass)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE, cluster = subclass)
##
## coef exp(coef) se(coef) robust se z p
## A 0.43 1.53 0.07 0.07 6 1e-09
##
## Likelihood ratio test=39 on 1 df, p=5e-10
## n= 882, number of events= 882
```

The `coef`

column contains the log HR, and `exp(coef)`

contains the HR. Remember to always use the `robust se`

for the standard error of the log HR. The displayed z-test p-value results from using the robust standard error.

**Using the bootstrap.** Austin and Small (2014) found bootstrap confidence intervals to work well for marginal hazard ratios. Below we demonstrate this using similar code as before to implement the block bootstrap:

```
#Block bootstrap confidence interval
# library(boot)
<- levels(md$subclass)
pair_ids
<- function(pairs, i) {
est_fun
#Compute number of times each pair is present
<- table(pairs[i])
numreps
#For each pair p, copy corresponding md row indices numreps[p] times
<- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
ids function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
<- md[ids,]
md_boot
#Effect estimation
<- coxph(Surv(Y_S) ~ A, data = md_boot, weights = weights)
cox_fit_boot
#Compute the marginal HR by exponentiating the coefficient
#on treatment
<- exp(coef(cox_fit_boot)["A"])
HR
#Return the HR
return(HR)
}
<- boot(pair_ids, est_fun, R = 499)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

**With covariate adjustment.** As with binary outcomes, if covariates are included in the model, the resulting hazard ratios will be conditional rather than marginal. Austin, Thomas, and Rubin (2020) describe several ways of including covariates in a model to estimate the marginal HR, but because they do not develop standard errors and little research has been done on this method, we will not present it here.

Pair matching with replacement makes estimating effects and standard errors a bit less straightforward. Control units paired with multiple treated units belong to multiple pairs at the same time and appear multiple times in the matched dataset. Effect and standard error estimation need to account for control unit multiplicity (i.e., repeated use) and within-pair correlations (Hill and Reiter 2006; Austin and Cafri 2020).

`MatchIt`

provides two interfaces for extracting the matched dataset after matching with replacement. The first uses `match.data()`

and provides the same output as when used after matching without replacement, except that the `subclass`

column is absent because units are not assigned to a single subclass but rather to several. Control units will have weights that differ from 1 to reflect their use as matches for multiple treated units. When using the `match.data()`

interface, including the weights in the estimation of effects is crucial. Because pair membership is omitted, accounting for it (if desired) must be done by conditioning on covariates used to match the pairs or by bootstrapping the entire process.

The second interface uses `get_matches()`

, which functions similarly to `match.data()`

except that the dataset contains one row per unit per pair, so control units matched to multiple treated units will have multiple rows in the dataset, and a `subclass`

column is included denoting pair membership. In the `get_matches()`

output, each reused control unit will have multiple rows, identical to each other except with different `subclass`

membership. When performing k:1 matching, weights must also be included in effect estimation when using `get_matches()`

.

Here we will demonstrate the estimation of effects using both `match.data()`

and `get_matches()`

output. The `match.data()`

output is preferred when pair membership is not directly included in the analysis, and the `get_matches()`

output is preferred when pair membership is to be included. Here will will use 3:1 matching with replacement and with a caliper; note that because a caliper was used, the estimand corresponds to neither the ATT nor the ATE but rather to an ATM.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr + X7 + X8 + X9, data = d,
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
mNNr
```

```
## A matchit object
## - method: 3:1 nearest neighbor matching with replacement
## - distance: Propensity score [caliper]
## - estimated with logistic regression and linearized
## - caliper: <distance> (0.13)
## - number of obs.: 2000 (original), 1040 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
#match.data output
<- match.data(mNNr)
md nrow(md)
```

`## [1] 1040`

`head(md)`

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights
## 1 0 0.1725 -1.4283 -0.4103 -2.3606 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 -2.381 0.9571
## 3 0 0.1768 0.7905 -0.8436 0.8237 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 -1.253 0.4785
## 5 1 0.3563 -1.8121 0.8135 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 -0.270 1.0000
## 7 0 1.8402 1.7601 -1.0746 -1.6428 1 1.4482 0.7131 0.6972 -0.94673 4.28651 1 97.49 -2.281 0.4785
## 9 0 0.7808 1.3137 0.6580 0.8540 1 0.9495 -0.5731 -0.2362 -0.14580 15.89771 1 67.53 -1.677 0.4785
## 10 1 -0.5651 -0.1053 -0.1369 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.07888 1 113.70 -1.607 1.0000
```

```
#get_matches output
<- get_matches(mNNr)
gm nrow(gm)
```

`## [1] 1681`

`head(gm)`

```
## id subclass weights A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance
## 1 5 1 1.0000 1 0.35631 -1.8121 0.81349 -0.6719 1 -0.8297 1.7297 -0.6439 -0.02648 0.6806 0 182.73 -0.2700
## 2 740 1 0.3333 0 0.02725 -0.4570 -0.32506 1.1203 1 0.3795 0.6468 -0.9992 0.01917 -0.9042 1 215.25 -0.2736
## 3 939 1 0.3333 0 0.92888 -0.2562 0.39077 0.3305 1 0.2654 1.2038 0.1720 1.21006 7.8923 0 83.85 -0.2736
## 4 860 1 0.3333 0 -0.88047 0.9117 -0.38871 1.0696 0 1.7953 0.2470 -1.6115 1.49297 5.3808 0 236.01 -0.2752
## 5 10 2 1.0000 1 -0.56513 -0.1053 -0.13685 1.6233 1 -0.5304 -0.3342 0.4184 0.46308 1.0789 1 113.70 -1.6073
## 6 399 2 0.3333 0 0.80396 -0.8760 0.03938 -0.5895 1 0.2819 -0.6540 -0.5423 1.16852 4.4816 1 117.28 -1.6065
```

The `get_matches()`

output provides some additional information about the match. We can count how many times control units are reused and how many units are in each match strata (not all will have 3 control units due to the caliper).

```
#Number of time control units are rematched
table(table(gm$id[gm$A == 0]))
```

```
##
## 1 2 3 4 5 6 7 8 9 10 13 14
## 332 138 62 29 19 9 14 4 2 2 1 1
```

Here we can see that 332 control units were only used in one pair each, and one control unit was paired with 14 treated units (i.e., heavily reused).

```
#Number of control units in each match stratum
table(table(gm$subclass[gm$A == 0]))
```

```
##
## 1 2 3
## 9 9 409
```

Here we can see that 409 treated units have three matches, nine have two matches, and nine only have one match. The caliper did not end up restricting too many matches.

**Without covariate adjustment.** For continuous outcomes, we can regress the outcome on the treatment and include the weights in the estimation. We do this regardless of whether we are using the `match.data()`

output or the `get_matches()`

output (if we were doing 1:1 matching, the weights would not be necessary when using the `get_matches()`

output, but they don’t change the results if included).

If we don’t mind ignoring pair membership, we can use the `match.data()`

output to estimate the effect and standard errors. Here we use `vcovHC()`

to estimate regular robust standard errors that ignoring pairing. Hill and Reiter (2006) found these standard errors to be conservative for continuous outcomes.

```
#match.data() output
<- lm(Y_C ~ A, data = md, weights = weights)
fit1md
coeftest(fit1md, vcov. = vcovHC)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.798 0.420 4.28 2.0e-05 ***
## A 2.065 0.514 4.02 6.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

If we want to incorporate pair membership into the standard error estimation, we have to use the `get_matches()`

output. In addition to supplying pair membership (`subclass`

) to the standard error estimator, we also supply unit ID (`id`

) to account for the fact that several rows may refer to the same control unit.

```
#get_matches() output
<- lm(Y_C ~ A, data = gm, weights = weights)
fit1gm
coeftest(fit1gm, vcov. = vcovCL, cluster = ~subclass + id)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.798 0.416 4.32 1.7e-05 ***
## A 2.065 0.510 4.05 5.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note that the effect estimates are identical; only the standard errors and p-values differ between the approaches^{4}.

**Using the bootstrap.** We can also use bootstrapping to estimate standard errors and confidence intervals. Although Abadie and Imbens (2008) demonstrated analytically that bootstrap standard errors may be invalid for matching with replacement, simulation work by Hill and Reiter (2006) and Bodory et al. (2020) has found that bootstrap standard errors are adequate and generally slightly conservative. Given this disagreement, it may be worth proceeding with caution when using bootstrapping to estimate standard errors after matching with replacement. Simulation experiments with matching with replacement have only considered the full bootstrap and found it to yield confidence intervals with approximately nominal coverage, so we recommend this approach over the block bootstrap for now and demonstrate it below.

```
#Full bootstrap confidence interval
# library(boot)
<- function(data, i) {
est_fun #Matching function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr_boot + X7 + X8 + X9, data = data[i,],
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
<- match.data(mNNr_boot)
md_boot
#Effect estimation
<- lm(Y_C ~ A, data = md_boot, weights = weights)
fit_boot
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}
<- boot(d, est_fun, R = 499)
boot_est boot_est
```

```
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = d, statistic = est_fun, R = 499)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.065 -0.1287 0.4398
```

`boot.ci(boot_est, type = "perc")`

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 499 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_est, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 1.049, 2.796 )
## Calculations and Intervals on Original Scale
```

**With covariate adjustment.** We can include covariates in the outcome model just as we could when matching without replacement. Hill and Reiter (2006) found that standard errors that fail to account for pair membership can have lower the nominal confidence interval coverage, so we recommend using the `get_matches()`

output to address both multiplicity and clustering.

```
<- lm(Y_C ~ A + X1 + X2 + X3 + X4 + X5 +
fit2md + X7 + X8 + X9, data = gm,
X6 weights = weights)
coeftest(fit1gm, vcov. = vcovCL, cluster = ~subclass + id)["A",,drop = FALSE]
```

```
## Estimate Std. Error t value Pr(>|t|)
## A 2.065 0.51 4.048 5.392e-05
```

Remember that the coefficients and tests on the predictors other than the treatment should not be interpreted because they may be subject to confounding even if the treatment is not, so we omit them here.

The primary difference between dealing with binary and continuous outcomes is the noncollapsibility of the effect measures for binary outcomes, meaning that including covariates in the outcome model is less straightforward because the coefficient on treatment does not correspond to the marginal treatment effect. Similar to continuous outcomes, when estimating the treatment effect, we can use either the output of `match.data()`

or the output of `get_matches()`

, only the latter of which allows us to account both for multiplicity in the control units and for pair membership. Below we’ll demonstrate estimating the marginal OR accounting for pair membership using the `get_matches()`

output. Then we will demonstrate using the bootstrap to estimate standard errors that include covariates in the model. Note that there has been little research on the use of matching with replacement with binary outcomes, so one should proceed with caution and consider using a better-studied matching method.

**Without covariate adjustment.** We include the weights in the call to `glm()`

and we include both `subclass`

and `id`

as clustering variables in computing the cluster-robust standard errors, just as we would with a continuous outcome. We can use `family = quasibinomial`

instead of `family = binomial`

to avoid a warning due to the use of weights; the estimates will be the same either way. To estimate the marginal RR or RD, `"logit"`

would be replaced with `"log"`

or `"identity"`

, respectively.

```
<- glm(Y_B ~ A, data = gm, weights = weights,
fit3gm family = quasibinomial(link = "logit"))
coeftest(fit3gm, vcov. = vcovCL, cluster = ~ subclass + id)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.766 0.139 -5.50 3.9e-08 ***
## A 0.639 0.169 3.78 0.00016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3gm)) #OR`

```
## (Intercept) A
## 0.4648 1.8954
```

**With covariate adjustment and bootstrapping.** To include covariates, we can use the bootstrap as we did before when matching without replacement. It doesn’t matter whether the `match.data()`

or `get_matches()`

output is used because, as we saw before, both yield the same effect estimate in each bootstrap replication. Here we use the full bootstrap rather than the block bootstrap because the performance of the block bootstrap after matching with replacement has not been evaluated.

```
#Bootstrap confidence intervals
# library(boot)
<- function(data, i) {
est_fun #Matching function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mNNr_boot + X7 + X8 + X9, data = data[i,],
X6 link = "linear.logit", caliper = .1,
ratio = 3, replace = TRUE)
<- match.data(mNNr_boot)
md_boot
#Fitting the model
<- glm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9, data = md_boot,
X6 family = quasibinomial(link = "logit"),
weights = weights)
#Estimate potential outcomes for each unit
$A <- 0
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P0 w = md_boot$weights)
<- P0 / (1 - P0)
Odds0
$A <- 1
md_boot<- weighted.mean(predict(fit_boot, md_boot, type = "response"),
P1 w = md_boot$weights)
<- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(d, est_fun, R = 4999)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

As before, to use bootstrapping for the RR or RD, the part of the code that computes the marginal OR can be replaced with code to compute the marginal RR (`P1 / P0`

) or marginal RD (`P1 - P0`

). The outcome model can additionally include treatment-covariate interactions if desired.

Standard error estimation for the marginal HR after matching with replacement is not a well-studied area, with Austin and Cafri (2020) providing the sole examination into appropriate methods for doing so. With survival outcomes, other matching methods may be more appropriate until matching with replacement is better understood. Here we provide an example that implements the recommendations by Austin and Cafri (2020). Any other methods (e.g., bootstrap) should be used with caution until they have been formally evaluated.

**Without covariate adjustment.** According to the results of Austin and Cafri’s (2020) simulation studies, when prevalence of the treatment is low (<30%), a standard error that does not involve pair membership is sufficient. When treatment prevalence is higher, the standard error that ignores pair membership may be too low, and the authors recommend a custom standard error estimator that uses information about both multiplicity and pairing.

For the continuous and binary outcomes, accounting for both multiplicity and pair membership is fairly straightforward thanks to the ability of the `sandwich`

package functions to include multiple sources of clustering. Unfortunately, this must be done manually for survival models. We perform this analysis below, adapting code from the appendix of Austin and Cafri (2020) to the `get_matches()`

output.

```
#Austin & Cafri's (2020) SE estimator
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fs weights = weights, cluster = subclass)
<- fs$var
Vs <- nlevels(gm$subclass)
ks
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fi weights = weights, cluster = id)
<- fi$var
Vi <- length(unique(gm$id))
ki
<- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE,
fc weights = weights)
<- fc$var
Vc <- nrow(gm)
kc
#Compute the variance
<- (ks/(ks-1))*Vs + (ki/(ki-1))*Vi - (kc/(kc-1))*Vc
V
#Sneak it back into the fit object
$var <- V
fc
fc
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = gm, weights = weights,
## robust = TRUE)
##
## coef exp(coef) se(coef) robust se z p
## A 0.44 1.55 0.07 0.07 6 1e-09
##
## Likelihood ratio test=40 on 1 df, p=2e-10
## n= 1681, number of events= 1681
```

The `robust se`

column contains the computed standard error, and the reported Z-test uses this standard error. The `se(coef)`

column should be ignored.

Full matching presents fairly straightforward methods of effect and standard error estimation. The most common and recommended way to estimate effects after full matching is to use the computed matching weights to estimate weighted effects. These matching weights function essentially like inverse probability weights and can be treated as such in all analyses, including with respect to standard error estimation. For empirical comparisons between full matching and propensity score weighting, see Austin and Stuart (2015b, 2017, 2015a).

Standard error estimation involves using a cluster-robust standard error as recommended by Austin and Stuart (2017). Including covariates can improve precision and add robustness when valid. Note that the methods described here also work for other estimators that rely on matching weights, including matching without replacement. The main difference is that full matching can be used to estimate the ATE, which we demonstrate below.

Below, we perform optimal full propensity score matching. Here we use full matching to estimate the ATE, but the procedure is nearly identical when estimating the ATT, and we point out any differences.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mF + X7 + X8 + X9, data = d,
X6 method = "full", estimand = "ATE")
mF
```

```
## A matchit object
## - method: Optimal full matching
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 2000 (matched)
## - target estimand: ATE
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mF)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 0.8166 58
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63 0.01855 0.7899 132
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1.0393 97
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06 0.04180 0.8445 136
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 0.3307 204
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73 0.04998 0.7990 168
```

A benefit of full matching is that no units are discarded, which has the potential to improve precision and prevent bias due to incomplete matching. However, the “effective” sample size implied by the matching weights is lower than the actual remaining sample size, so one should not always expect full matching to yield more precise estimates than other forms of matching.

**Without covariate adjustment.** Estimating effects and standard errors for continuous outcomes after full matching involves including the matching weights in the outcome model and using a cluster-robust standard error.

```
<- lm(Y_C ~ A, data = md, weights = weights)
fit1
coeftest(fit1, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.436 0.250 5.75 1e-08 ***
## A 1.892 0.519 3.65 0.00027 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Using the bootstrap.** Computing bootstrap standard errors and confidence intervals after full matching is identical to doing so after pair matching without replacement, so we do not demonstrate it again here. Covariates can be included in the outcome model used in the bootstrap replications. See below for a note on centering the covariates when including treatment-covariate interactions, as this must be done when using the bootstrap as well.

**With covariate adjustment.** As previously mentioned, it is generally acceptable to include just the main effects, but including interactions between the treatment and covariates can be beneficial when effect modification by the covariates may be present. Because we have not demonstrated this strategy so far, we demonstrate it below.

In order to interpret the coefficient on treatment as a marginal effect estimate, we need to center the covariates at their means in the target population (i.e., the original sample for the ATE, the treated units for the ATT, or the retained units for an ATM); we could also use a marginal effects procedure as has been demonstrated with binary outcomes for other matching methods. Below we use the strategy of centering the covariates at their means. Note that when factor predictors are present, they need to be split into dummy (0/1) variables prior to centering. The `splitfactor()`

function in `cobalt`

can make this straightforward. Although this procedure is more involved compared to simply including main effects, it can provide extra precision and robustness.

```
#Estimating a covariate-adjusted marginal effect
#with treatment-covariate interactions
#Create a new dataset for centered variables
<- md
md_cen
<- c("X1", "X2", "X3", "X4", "X5",
covs_to_center "X6", "X7", "X8", "X9")
<- scale(md_cen[covs_to_center],
md_cen[covs_to_center] scale = FALSE)
#Fit the model with every covariate interacting with treatment
<- lm(Y_C ~ A * (X1 + X2 + X3 + X4 + X5 +
fit2 + X7 + X8 + X9),
X6 data = md_cen, weights = weights)
#Only output the intercept and coefficient on treatment
coeftest(fit2, vcov. = vcovCL, cluster = ~subclass)[1:2,]
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.400 0.1645 8.511 3.349e-17
## A 1.979 0.4169 4.748 2.207e-06
```

Remember not to interpret the coefficients on the covariates or the treatment-covariate interactions, as they are likely confounded. To keep the output clean, above we restricted the output to just the intercept and coefficient on treatment. Another benefit of centering the covariates is that we can interpret the intercept as an estimate of the average potential outcome under control.

Note that the above strategy can be applied to all matching methods when analyzing continuous outcomes (but not binary or survival outcomes, which require bootstrapping to validly estimate standard errors with covariate adjustment). It is critical to center the covariates at their means in the target group, which may require some additional programming for estimands other than the ATE.

Using full matching with binary outcomes was described by Austin and Stuart (2017). In general, the procedures look similar to how they do with other matching methods.

**Without covariate adjustment.** We can use a weighted generalized linear model regressing the outcome on the treatment with a link function appropriate to the effect measure of interest. Below we demonstrate estimating the marginal OR after full matching in a model without covariates:

```
<- glm(Y_B ~ A, data = md, weights = weights,
fit3 family = quasibinomial(link = "logit"))
coeftest(fit3, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9146 0.0826 -11.07 <2e-16 ***
## A 0.5545 0.1722 3.22 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3)) #OR`

```
## (Intercept) A
## 0.4007 1.7410
```

As with matching with replacement, we include weights in the call to `glm()`

and set `family = quasibinomial()`

to prevent a warning that occurs when using weights with binomial regression models (though the results do not differ). Setting `link = "logit"`

provides the marginal OR; for the marginal RR, we would replace `"logit"`

with `"log"`

, and for the marginal RD, we would replace `"logit"`

with `"identity"`

and not exponentiate the coefficient.

**With covariate adjustment and bootstrapping.** To include covariates in the model, a marginal effects procedure must be used with bootstrapping to recover the marginal effect because the coefficient on treatment in such a model corresponds to a conditional effect. A bootstrap must be used to estimate the standard error and confidence interval of the marginal effect. Either the full bootstrap or block bootstrap can be used, though the performance of the latter has not been formally evaluated (but because it is an approximation to cluster-robust standard errors, which are valid, it is likely valid). The code for the full bootstrap with full matching is identical to the code for bootstrapping with binary outcomes for pair matching with replacement (except that the call to `matchit()`

in the `est_fun`

function must be adjusted to perform full matching), and the code for the block bootstrap with full matching is identical to the code for bootstrapping with binary outcomes for pair matching without replacement, so we do not repeat it here.

Austin and Stuart (2015b) describe the use of the full matching with survival outcomes.

**Without covariate adjustment.** To estimate the marginal HR, we can regress the outcome on the treatment in a Cox regression model weighted by the matching weights and including subclasses as a cluster. Below we demonstrate how to estimate the marginal HR and its standard error after full matching.

```
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights, cluster = subclass)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE, cluster = subclass)
##
## coef exp(coef) se(coef) robust se z p
## A 0.42 1.52 0.05 0.09 5 2e-06
##
## Likelihood ratio test=54 on 1 df, p=2e-13
## n= 2000, number of events= 2000
```

To perform the log-rank test or compute survival curves after full matching, functions designed for performing these tasks with inverse probability weights can be used with the matching weights; the `RISCA`

package offers functionality for this purpose.

Including covariates in the model is less straightforward because the resulting HR estimate is conditional rather than marginal.

Stratum matching includes exact matching, coarsened exact matching, and propensity score subclassification. There are two natural ways to estimate marginal effects after stratum matching: the first is to estimate stratum-specific treatment effects and pool them, and the second is to use the stratum weights to estimate a single marginal effect. This latter approach is also known as marginal mean weighting through stratification (MMWS), and is described in detail by Hong (2010)^{5}. When done properly, both methods should yield similar or identical estimates of the treatment effect. MMWS is generally preferable because it is far simpler to implement and avoids issues of noncollapsibility with non-continuous outcomes. All of the methods described above for use with full matching also work with MMWS because the formation of the weights is the same; the only difference is that it is not appropriate to use cluster-robust standard errors with MMWS because of how few clusters are present.

Unless exact matching is used, estimating stratum-specific treatment effects can be fraught because balance may not be achieved within strata even if balance is achieved across strata. Stratum-specific effects should be interpreted with caution. Stratum-specific effects are conditional effects, but conditional on stratum membership, which may not always be a useful conditioning variable.

For each outcome type, we focus on estimating marginal effects using MMWS and using the strata directly. Below, we perform propensity score subclassification for the ATT using 8 subclasses.

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mS + X7 + X8 + X9, data = d,
X6 method = "subclass", estimand = "ATT",
subclass = 8)
mS
```

```
## A matchit object
## - method: Subclassification (8 subclasses)
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 2000 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9
```

```
<- match.data(mS)
md
head(md)
```

```
## A X1 X2 X3 X4 X5 X6 X7 X8 X9 Y_C Y_B Y_S distance weights subclass
## 1 0 0.1725 -1.4283 -0.4103 -2.36059 1 -1.1199 0.6398 -0.4840 -0.59385 0.07104 0 278.46 0.08461 0.2458 1
## 2 0 -1.0959 0.8463 0.2456 -0.12333 1 -2.2687 -1.4491 -0.5514 -0.31439 0.15619 0 330.63 0.01855 0.2458 1
## 3 0 0.1768 0.7905 -0.8436 0.82366 1 -0.2221 0.2971 -0.6966 -0.69516 -0.85180 1 369.94 0.22210 1.0742 3
## 4 0 -0.4595 0.1726 1.9542 -0.62661 1 -0.4019 -0.8294 -0.5384 0.20729 -2.35184 0 91.06 0.04180 0.2458 1
## 5 1 0.3563 -1.8121 0.8135 -0.67189 1 -0.8297 1.7297 -0.6439 -0.02648 0.68058 0 182.73 0.43291 1.0000 5
## 6 0 -2.4313 -1.7984 -1.2940 0.04609 1 -1.2419 -1.1252 -1.8659 -0.56513 -5.62260 0 2563.73 0.04998 0.2458 1
```

For continuous outcomes, we can use either MMWS or compute the weighted average of within-subclass effects. First we illustrate weighting.

**Without covariate adjustment.** With weighting, we can supply the weights that are in the `match.data()`

output to a call to `lm()`

to perform weighted least squares regression, as we did with full matching. We need a robust standard error estimator to account for the weights. Note that the subclasses don’t even need to enter this analysis; they are fully incorporated through the MMWS weights^{6}. We use `vcovHC()`

to estimate the regular robust standard error instead of the cluster-robust standard error used with other methods.

```
<- lm(Y_C ~ A, data = md, weights = weights)
fit1
coeftest(fit1, vcov. = vcovHC)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.964 0.287 6.85 9.8e-12 ***
## A 1.930 0.407 4.74 2.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can also fit a model within each subclass to estimate the within-stratum treatment effects and then compute a weighted average of them to be used as the marginal effect. The stratum weights in the weighted average must be equal to the proportion of treated units in each subclass if the ATT is targeted. If the ATE is targeted, the weights must be equal to the proportion of all units in each subclass. There are other ways to construct weight to minimize variance at the expense of losing the original target population (Rudolph et al. 2016).

Instead of fitting separate models for each subclass, we can fit a single model that fully interacts the treatment with subclass membership and then perform a linear hypothesis test. To do so, we use the form `Y ~ S + S:A - 1`

in the call to `lm()`

. This includes main effects for subclass and treatment interaction terms for each subclass and omits an intercept. The fit of this model is equivalent to that of a traditional full factorial model, so no information is lost using this parameterization and using it makes it easier to construct the stratum weights. This estimates the subclass-specific intercept and subclass-specific treatment effect in each subclass. We would use a robust standard error to account for different residual variance across subclasses.

```
<- lm(Y_C ~ subclass + subclass:A - 1, data = md)
fit2
#Within-subclass effects
# coeftest(fit2, vcov. = vcovHC)
```

The within-subclass effects should only be trusted if balance is achieved in each subclass. In this example, balance has not been achieved within some subclasses, so we would not interpret these effects. Next we construct the weights to form the weighted average of the subclass effects. The weights take the form of a linear contrast matrix with zeroes for the subclass-specific intercepts and the subclass proportions for the corresponding subclass-specific treatment effect coefficients.

```
#Subclass weights for ATT
<- with(md, c(rep(0, nlevels(subclass)),
sub_w table(subclass[A==1])/sum(A==1)))
#Subclass weights for ATE (requires estimand = "ATE" in matchit())
# sub_w <- with(md, c(rep(0, nlevels(subclass)),
# table(subclass)/nrow(md)))
#Marginal effect
<- weighted.mean(coef(fit2), sub_w)) (est
```

`## [1] 1.93`

```
#SE of marginal effect
<- sqrt(drop(sub_w %*% vcovHC(fit2) %*% sub_w))) (se
```

`## [1] 0.4036`

```
#CI
c(ci_low = est - 1.96*se, ci_hi = est + 1.96*se)
```

```
## ci_low ci_hi
## 1.139 2.721
```

The following lines would have produced the same output but require the `margins`

package:

```
#Using margins() from margins
summary(margins::margins(fit2, variables = "A",
data = md[md$A == 1,],
vcov = vcovHC(fit2)))
#For ATE, omit the second line.
```

**With covariate adjustment.** To include covariates in the model when using MMWS, we can modify the code used for weighting after full matching, the only difference being that regular robust standard errors should be used with MMWS. As before, treatment-covariate interactions are optional but can reduce bias and improve precision when there effect modification by the covariates. When including these interactions in the outcome model, it is important to center the covariates at their means in the target population (i.e., the full sample for the ATE and the treated units for the ATT) in order to interpret the coefficient on treatment as a marginal effect.

To include covariates in the model when combining subclass-specific effect estimates, it can be challenging to correctly parameterize the model so that the linear contrast matrix method works as expected. The simplest way would be to include covariates as desired, which include as main effects, interactions with treatment, interactions with subclass, or all three, and use the `margins()`

code above, which should automatically provide the correct output.

Using stratification with binary outcomes is slightly more complicated than it is with continuous outcomes. This is because the OR and RR are not collapsible, so the marginal OR and RR cannot be computed as the weighted average of the stratum-specific effects. Instead, one must compute the average of the predicted stratum-specific risks under each treatment and then compute the marginal effect estimate from these marginal risks. Although stratum-specific conditional ORs are valid effect measures, they generally do not correspond to meaningful subpopulation effects unless the strata themselves are meaningful subpopulations. After exact matching or coarsened exact matching, strata may be meaningful because they correspond to specific combinations of covariates that may come close to designating specific patient attributes, but after propensity score subclassification, the strata correspond to propensity score bins, which are generally not meaningful. Although some researchers have interpreted stratum-specific effects after propensity score subclassification as representing effects at various risks or propensities for treatment, because the primary purpose of the propensity score is as a balancing score and not as an accurate estimate of propensity for treatment, such an interpretation should be regarded with caution.

Austin (2007) compared several methods of propensity score adjustment for estimating marginal ORs, including two methods based on propensity score stratification, one of which involved the stratified Mantel-Haenszel estimator, and the other of which involved averaging stratum-specific effects. Both of these estimate a common conditional OR, not a marginal OR (Forbes and Shortreed 2008; Stampf et al. 2010), and both yielded positively biased effect estimates for non-null treatment effects, a common pattern when using conditional effect estimates as estimates of marginal effects. Given the difficulties in estimating marginal ORs after stratification, the most straightforward way to do so is to use MMWS. We do, however, also demonstrate estimating the marginal odds ratio and its standard error using bootstrapping in a way that can incorporate covariate adjustment and allow for differential effect modification across strata.

**Without covariate adjustment.** As before, we can supply the stratification weights to a weighted generalized linear model with just the treatment as the sole predictor, and the coefficient on treatment will correspond to a marginal treatment effect. We use `vcovHC()`

to estimate the regular robust standard error.

```
<- glm(Y_B ~ A, data = md, weights = weights,
fit3 family = quasibinomial(link = "logit"))
coeftest(fit3, vcov. = vcovHC)
```

```
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.831 0.108 -7.71 1.3e-14 ***
## A 0.726 0.144 5.04 4.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`exp(coef(fit3))`

```
## (Intercept) A
## 0.4357 2.0675
```

As with other matching methods, we include weights in the call to `glm()`

and set `family = quasibinomial()`

to prevent a warning that occurs when using weights with binomial regression models (though the results do not differ). Setting `link = "logit"`

provides the marginal OR; for the marginal RR, we would replace `"logit"`

with `"log"`

, and for the marginal RD, we would replace `"logit"`

with `"identity"`

and not exponentiate the coefficient.

**With covariate adjustment and bootstrapping.** We can use bootstrapping to estimate the marginal OR and its standard error by estimating the average of the stratum-specific risks under each treatment level, computing the marginal risks under each treatment, and computing marginal effects from the marginal risks. This also makes it fairly straightforward to include covariates in the model. Below we illustrate bootstrapping for the marginal OR with covariates included in the outcome model.

```
#Bootstrap confidence intervals
library(boot)
<- function(data, i) {
est_fun #Subclassification function
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mS_boot + X7 + X8 + X9, data = data[i,],
X6 method = "subclass", estimand = "ATT",
subclass = 8)
<- match.data(mS_boot)
md_boot
#Fitting the model
<- glm(Y_B ~ A * (subclass + X1 + X2 + X3 + X4 + X5 +
fit_boot + X7 + X8 + X9), data = md_boot,
X6 family = quasibinomial(link = "logit"))
#Estimate potential outcomes for each unit
## Subset to just the treated for the ATT; remove this for the ATE
<- md_boot[md_boot$A == 1,]
md_boot ##
$A <- 0
md_boot<- mean(predict(fit_boot, md_boot, type = "response"))
P0 <- P0 / (1 - P0)
Odds0
$A <- 1
md_boot<- mean(predict(fit_boot, md_boot, type = "response"))
P1 <- P1 / (1 - P1)
Odds1
#Return marginal odds ratio
return(Odds1 / Odds0)
}
<- boot(d, est_fun, R = 4999)
boot_est
boot_estboot.ci(boot_est, type = "bca")
```

In this example, we included interactions between treatment and subclass and between treatment and each covariate. Note that because we are interested in the ATT, we restricted the sample used to compute the predicted marginal risks (`P0`

) and (`P1`

) to just those with `A = 1`

. If we were instead estimating the ATE, we would supply `"ATE"`

that to the `estimand`

argument in the call to `matchit()`

and skip the step of restricting the data used for prediction of the marginal risks.

As with other methods, to estimate the marginal RR or RD using the above code, the returned object can instead be specified as `P1 / P0`

or `P1 - P0`

, respectively.

Like ORs, HRs are not collapsible, so it is not straightforward to estimate marginal HRs using within-stratum HRs. Austin (2013a) examined the performance of several propensity score subclassification-based estimators of the marginal HR and found all to be positively biased for non-null effects, consistent with the use of conditional effect estimates as estimates of marginal effects; indeed, the subclassification methods examined all relied on pooling stratum-specific effects. Given these difficulties, the most straightforward method to estimate marginal HRs is to use MMWS weights. We demonstrate this below using essentially the same syntax as used with full matching, only omitting subclass membership as a clustering variable.

```
coxph(Surv(Y_S) ~ A, data = md, robust = TRUE,
weights = weights)
```

```
## Call:
## coxph(formula = Surv(Y_S) ~ A, data = md, weights = weights,
## robust = TRUE)
##
## coef exp(coef) se(coef) robust se z p
## A 0.46 1.58 0.05 0.07 7 2e-11
##
## Likelihood ratio test=64 on 1 df, p=1e-15
## n= 2000, number of events= 2000
```

The robust standard error must be used because of the MMWS weights.

As discussed previously, with binary and time-to-event outcomes, marginal and conditional effects generally differ. Though we have primarily focused on marginal effects, we offer some guidance here on estimating conditional effect as well. Estimating conditional effects typically involves modeling the outcome, and inferences are dependent on this model being correct, even in the presence is perfect covariate balance. Matching provides robustness to some forms of misspecification, however, by limiting the range of the covariate space. So, even though outcome modeling is required, matching prior to modeling the outcome can be beneficial in terms of reducing bias due to model misspecification.

On some effect measures, conditional effects typically vary depending on the covariate values at which the conditional effect is to be estimated; for example, the decrease in the risk of death (i.e., on the RD scale) corresponding to receipt of a treatment for a patient with a high baseline risk of death will by higher than that for a patient with a low baseline risk of death. In this sense, there is necessarily effect modification of the RD by baseline risk (which depends on a patient’s covariate values). Similarly, for an exposure that increases the risk by a factor (e.g., doubles the risk, for a RR of 2), a patient with a high baseline risk cannot experience the same change in risk as a patient with a low baseline risk could; for example, a RR of 2 is impossible for a patient with a baseline risk of .6 but is certainly plausible for a patient with a baseline risk of .02. In this sense, there is necessarily effect modification of the RR by baseline risk. This is not true for the OR; it is plausible that the effect on the OR scale could be consistent across levels of levels of the predictors because any OR is compatible with possible baselines risks. For this reason, we only consider estimating the conditional OR and not other effect measures.

Although several methods exist for estimating conditional ORs after matching or subclassification, the one that generally works well is to run a logistic regression of the outcome on the treatment and covariates and use the coefficient on treatment as an estimate of the effect on the log OR scale. We demonstrate this below using full matching for the ATT:

```
<- matchit(A ~ X1 + X2 + X3 + X4 + X5 +
mF + X7 + X8 + X9, data = d,
X6 method = "full", estimand = "ATT")
<- match.data(mF)
md
<- lm(Y_B ~ A + X1 + X2 + X3 + X4 + X5 +
fitcond + X7 + X8 + X9, data = md,
X6 weights = weights)
coeftest(fitcond, vcov. = vcovCL, cluster = ~subclass)["A",,drop = FALSE]
```

```
## Estimate Std. Error t value Pr(>|t|)
## A 0.1219 0.04171 2.922 0.003516
```

`exp(coef(fitcond)["A"])`

```
## A
## 1.13
```

As with other covariate-adjusted models, it is important not to interpret the coefficients on covariates other than treatment to avoid committing the table 2 fallacy. Other causes of the outcome can be included in the outcome model even if they were not the focus of matching as long as they are not (even possibly) caused by the treatment.

Alternative methods of estimating conditional effects include conditional logistic regression after matching and estimating stratum-specific effects after subclassification. We recommend using covariate-adjusted logistic regression models instead because the conditional effect is better defined (i.e., conditional on the specific covariates included in the model) and depends less on the estimand targeted (Forbes and Shortreed 2008).

Moderation analysis involves determining whether a treatment effect differs across levels of another variable. The use of matching with moderation analysis is described in Stuart and Green (2014). The goal is to achieve balance within each subgroup of the potential moderating variable, and there are several ways of doing so. Broadly, one can either perform matching in the full dataset, perhaps requiring exact matching on the moderator, or one can perform completely separate analyses in each subgroup. We’ll demonstrate both approaches below. The chosen approach should be that which achieves the best balance, though we don’t demonstrate assessing balance here to maintain focus on effect estimation. We’ll consider the binary variable `X5`

to be the potential moderator of the effect of `A`

on `Y_C`

.

The first approach involves pooling information across subgroups. This could involve estimating propensity scores using a single model for both groups but exact matching on the potential moderator. The propensity score model could include moderator-by-covariate interactions to allow the propensity score model to vary across subgroups on some covariates. Below, we’ll estimate a propensity score using a single propensity score model with a few moderator-by-covariate interactions and exact matching on the moderator, `X5`

. We’ll perform nearest neighbor matching on the propensity score.

```
<- matchit(A ~ X1 + X2 + X5*X3 + X4 +
mP *X6 + X7 + X5*X8 + X9, data = d,
X5exact = ~X5, method = "nearest")
mP
```

```
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 2000 (original), 882 (matched)
## - target estimand: ATT
## - covariates: X1, X2, X5, X3, X4, X6, X7, X8, X9
```

Although it is straightforward to assess balance overall using `summary()`

, it is more challenging to assess balance within subgroups. The easiest way to check subgroup balance would be to use `cobalt::bal.tab()`

, which has a `cluster`

argument that can be used to assess balance within subgroups, e.g., by `cobalt::bal.tab(mP, cluster = "X5")`

. See the vignette “Appendix 2: Using cobalt with Clustered, Multiply Imputed, and Other Segmented Data” on the `cobalt`

website for details.

If we are satisfied with balance, we can then estimate the subgroup effects using an outcome model with an interaction between the treatment and the moderator.

```
<- match.data(mP)
mdP
<- lm(Y_C ~ A * X5, data = mdP, weights = weights)
fitP
coeftest(fitP, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7500 0.4515 1.66 0.0970 .
## A 2.2073 0.6703 3.29 0.0010 **
## X5 1.7413 0.6070 2.87 0.0042 **
## A:X5 -0.0275 0.8791 -0.03 0.9751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

A second approach is to perform the matching analyses separately and then combine the matched samples. We demonstrate this below. First, we split the data by levels of `X5`

.

```
<- subset(d, X5 == 0)
d_X5_0 <- subset(d, X5 == 1) d_X5_1
```

Next we perform separate matching analyses in each new dataset,

```
<- matchit(A ~ X1 + X2 + X3 + X4 +
mS0 + X7 + X8 + X9, data = d_X5_0,
X6 method = "nearest")
<- matchit(A ~ X1 + X2 + X3 + X4 +
mS1 + X7 + X8 + X9, data = d_X5_1,
X6 method = "nearest")
```

It is straightforward to assess balance within each subgroup using `summary()`

, and `cobalt`

functions can be used as well. We omit this step here, but you should not!

To estimate the subgroup effects, we need to combine the matched datasets, which we can do using `rbind.matchdata()`

(a special `rbind()`

method for `matchdata`

objects). Then we estimate the moderation effect just as we did previously.

```
<- match.data(mS0)
mdS0 <- match.data(mS1)
mdS1
<- rbind(mdS0, mdS1)
mdS
<- lm(Y_C ~ A * X5, data = mdS, weights = weights)
fitS
coeftest(fitS, vcov. = vcovCL, cluster = ~subclass)
```

```
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.720 0.297 2.42 0.016 *
## A 2.237 0.453 4.94 8.4e-07 ***
## X5 2.160 0.414 5.22 2.0e-07 ***
## A:X5 -0.446 0.593 -0.75 0.452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

There are benefits to using either approach, and Stuart and Green (2014) find that either can be successful at balancing the subgroups. The first approach may be most effective with small samples, where separate propensity score models would be fit with greater uncertainty and an increased possibility of perfect prediction or failure to converge (Wang et al. 2018). The second approach may be more effective with larger samples or with matching methods that target balance in the matched sample, such as genetic matching (Kreif et al. 2012). With genetic matching, separate subgroup analyses ensure balance is optimized within each subgroup rather than just overall.

It is important to be as thorough and complete as possible when describing the methods of estimating the treatment effect and the results of the analysis. This improves transparency and replicability of the analysis. Results should at least include the following:

- a description of the outcome model used (e.g., logistic regression, a linear model with treatment-covariate interactions and mean-centered covariates, a Cox proportional hazards model with the matching weights applied)
- the way the effect was estimated (e.g., as the coefficient on treatment in the outcome model, as the result of a marginal effects procedure)
- the way standard errors and confidence intervals were estimated (e.g., using robust standard errors, using cluster-robust standard errors with pair membership as the cluster, using the BCa bootstrap with 4999 bootstrap replications and the entire process of matching and effect estimation included in each replication)
- R packages and functions used in estimating the effect and its standard error (e.g.,
`glm()`

in base R,`vcovCL()`

in`sandwich`

,`boot`

and`boot.ci`

in`boot`

) - The effect and its standard error and confidence interval

All this is in addition to information about the matching method, propensity score estimation procedure (if used), balance assessment, etc. mentioned in the other vignettes.

There are a few common mistakes that should be avoided. It is important not only to avoid these mistakes in one’s own research but also to be able to spot these mistakes in others’ analyses.

Several methods involve weights that are to be used in estimating the treatment effect. With full matching and stratification matching (when analyzed using MMWS), the weights do the entire work of balancing the covariates across the treatment groups. Omitting weights essentially ignores the entire purpose of matching. Some cases are less obvious. When performing matching with replacement and estimating the treatment effect using the `match.data()`

output, weights must be included to ensure control units matched to multiple treated units are weighted accordingly. Similarly, when performing k:1 matching where not all treated units receive k matches, weights are required to account for the differential weight of the matched control units. The only time weights can be omitted after pair matching is when performing 1:1 matching without replacement. Including weights even in this scenario will not affect the analysis and it can be good practice to always include weights to prevent this error from occurring. There are some scenarios where weights are not useful because the conditioning occurs through some other means, such as when using the pooling strategy rather than MMWS for estimating marginal effects after stratification.

Robust standard errors are required when using weights to estimate the treatment effect. The model-based standard errors resulting from weighted least squares or maximum likelihood are inaccurate when using matching weights because they assume weights are frequency weights rather than probability weights. Cluster-robust standard errors account for both the matching weights and pair membership and should be used when appropriate (i.e., with all matching methods other than stratification matching). Sometimes, researchers use functions in the `survey`

package to estimate robust standard errors, especially with inverse probability weighting; this is a valid way to compute robust standard errors and will give similar results to `sandwich::vcovHC()`

.

The distinction between marginal and conditional effects is not always clear both in methodological and applied papers. Some statistical methods are valid only for estimating conditional effects and they should not be used to estimate marginal effects (without further modification). Sometimes conditional effects are desirable, and such methods may be useful for them, but when marginal effects are the target of inference, it is critical not to inappropriately interpret estimates resulting from statistical methods aimed at estimating conditional effects as marginal effects. Although this issue is particularly salient with binary and survival outcomes due to the general noncollapsibility of the OR, RR, and HR, this can also occur with linear models for continuous outcomes or the RD.

The following methods estimate **conditional effects** for binary or survival outcomes (with noncollapsible effect measures) and should **not** be used to estimate marginal effects:

- Logistic regression or Cox proportional hazards model with covariates and/or the propensity score included, using the coefficient on treatment as the effect estimate
- Conditional logistic regression after matching
- Stratified Cox regression after matching
- Averaging stratum-specific effect estimates after stratification, including using Mantel-Haenszel OR pooling
- Including pair or stratum fixed or random effects in a logistic regression model, using the coefficient on treatment as the effect estimate

In addition, with continuous outcomes, conditional effects can be mistakenly interpreted as marginal effect estimates when treatment-covariate interactions are present in the outcome model. If the covariates are not centered at their mean in the target population (e.g., the treated group for the ATT, the full sample for the ATE, or the remaining matched sample for an ATM), the coefficient on treatment will not correspond to the marginal effect in the target population; it will correspond to the effect of treatment when the covariate values are equal to zero, which may not be meaningful or plausible. Marginal effects procedures (e.g., using the `margins`

package or manually with bootstrapping as demonstrated above) are always the safest way to include covariates in the outcome model, especially in the presence of treatment-covariate interactions. Appropriately centering the covariates is a shortcut that is required when using the coefficient on treatment as a marginal effect estimate for continuous outcomes (demonstrated previously for full matching).

Abadie, Alberto, and Guido W. Imbens. 2008. “On the Failure of the Bootstrap for Matching Estimators.” *Econometrica* 76 (6): 1537–57. https://www.jstor.org/stable/40056514.

Abadie, Alberto, and Jann Spiess. 2019. “Robust Post-Matching Inference,” January, 34.

Austin, Peter C. 2007. “The Performance of Different Propensity Score Methods for Estimating Marginal Odds Ratios.” *Statistics in Medicine* 26 (16): 3078–94. https://doi.org/10.1002/sim.2781.

———. 2009. “Type i Error Rates, Coverage of Confidence Intervals, and Variance Estimation in Propensity-Score Matched Analyses.” *The International Journal of Biostatistics* 5 (1). https://doi.org/10.2202/1557-4679.1146.

———. 2013a. “The Performance of Different Propensity Score Methods for Estimating Marginal Hazard Ratios.” *Statistics in Medicine* 32 (16): 2837–49. https://doi.org/10.1002/sim.5705.

———. 2013b. “The Use of Propensity Score Methods with Survival or Time-to-Event Outcomes: Reporting Measures of Effect Similar to Those Used in Randomized Experiments.” *Statistics in Medicine* 33 (7): 1242–58. https://doi.org/10.1002/sim.5984.

Austin, Peter C., and Guy Cafri. 2020. “Variance Estimation When Using Propensity-Score Matching with Replacement with Survival or Time-to-Event Outcomes.” *Statistics in Medicine* 39 (11): 1623–40. https://doi.org/10.1002/sim.8502.

Austin, Peter C., and Dylan S. Small. 2014. “The Use of Bootstrapping When Using Propensity-Score Matching Without Replacement: A Simulation Study.” *Statistics in Medicine* 33 (24): 4306–19. https://doi.org/10.1002/sim.6276.

Austin, Peter C., and Elizabeth A. Stuart. 2015a. “The Performance of Inverse Probability of Treatment Weighting and Full Matching on the Propensity Score in the Presence of Model Misspecification When Estimating the Effect of Treatment on Survival Outcomes.” *Statistical Methods in Medical Research* 26 (4): 1654–70. https://doi.org/10.1177/0962280215584401.

———. 2015b. “Optimal Full Matching for Survival Outcomes: A Method That Merits More Widespread Use.” *Statistics in Medicine* 34 (30): 3949–67. https://doi.org/10.1002/sim.6602.

———. 2017. “Estimating the Effect of Treatment on Binary Outcomes Using Full Matching on the Propensity Score.” *Statistical Methods in Medical Research* 26 (6): 2505–25. https://doi.org/10.1177/0962280215601134.

Austin, Peter C., Neal Thomas, and Donald B. Rubin. 2020. “Covariate-Adjusted Survival Analyses in Propensity-Score Matched Samples: Imputing Potential Time-to-Event Outcomes.” *Statistical Methods in Medical Research* 29 (3): 728–51. https://doi.org/10.1177/0962280218817926.

Bodory, Hugo, Lorenzo Camponovo, Martin Huber, and Michael Lechner. 2020. “The Finite Sample Performance of Inference Methods for Propensity Score Matching and Weighting Estimators.” *Journal of Business & Economic Statistics* 38 (1): 183–200. https://doi.org/10.1080/07350015.2018.1476247.

Cameron, A. Colin, and Douglas L. Miller. 2015. “A Practitioner’s Guide to Cluster-Robust Inference.” *Journal of Human Resources* 50 (2): 317–72. https://doi.org/10.3368/jhr.50.2.317.

Carpenter, James, and John Bithell. 2000. “Bootstrap Confidence Intervals: When, Which, What? A Practical Guide for Medical Statisticians.” *Statistics in Medicine* 19 (9): 1141–64. https://doi.org/10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F.

Desai, Rishi J., Kenneth J. Rothman, Brian T. Bateman, Sonia Hernandez-Diaz, and Krista F. Huybrechts. 2017. “A Propensity-Score-Based Fine Stratification Approach for Confounding Adjustment When Exposure Is Infrequent:” *Epidemiology* 28 (2): 249–57. https://doi.org/10.1097/EDE.0000000000000595.

Efron, Bradley, and Robert J. Tibshirani. 1993. *An Introduction to the Bootstrap*. Springer US.

Forbes, Andrew, and Susan Shortreed. 2008. “Inverse Probability Weighted Estimation of the Marginal Odds Ratio: Correspondence Regarding ‘The Performance of Different Propensity Score Methods for Estimating Marginal Odds Ratios’ by P. Austin, Statictics in Medicine, 2007; 26:30783094.” *Statistics in Medicine* 27 (26): 5556–59. https://doi.org/10.1002/sim.3362.

Gayat, Etienne, Matthieu Resche-Rigon, Jean-Yves Mary, and Raphaël Porcher. 2012. “Propensity Score Applied to Survival Data Analysis Through Proportional Hazards Models: A Monte Carlo Study.” *Pharmaceutical Statistics* 11 (3): 222–29. https://doi.org/10.1002/pst.537.

Green, Kerry M., and Elizabeth A. Stuart. 2014. “Examining Moderation Analyses in Propensity Score Methods: Application to Depression and Substance Use.” *Journal of Consulting and Clinical Psychology*, Advances in Data Analytic Methods, 82 (5): 773–83. https://doi.org/10.1037/a0036515.

Hill, Jennifer, and Jerome P. Reiter. 2006. “Interval Estimation for Treatment Effects Using Propensity Score Matching.” *Statistics in Medicine* 25 (13): 2230–56. https://doi.org/10.1002/sim.2277.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” *Political Analysis* 15 (3): 199–236. https://doi.org/10.1093/pan/mpl013.

Hong, Guanglei. 2010. “Marginal Mean Weighting Through Stratification: Adjustment for Selection Bias in Multilevel Data.” *Journal of Educational and Behavioral Statistics* 35 (5): 499–531. https://doi.org/10.3102/1076998609359785.

King, Gary, and Margaret E. Roberts. 2015. “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” *Political Analysis* 23 (2): 159–79. https://doi.org/10.1093/pan/mpu015.

Kreif, Noemi, Richard Grieve, Rosalba Radice, Zia Sadique, Roland Ramsahai, and Jasjeet S. Sekhon. 2012. “Methods for Estimating Subgroup Effects in Cost-Effectiveness Analyses That Use Observational Data.” *Medical Decision Making* 32 (6): 750–63. https://doi.org/10.1177/0272989X12448929.

Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” *Biometrika* 73 (1): 13–22. https://doi.org/10.1093/biomet/73.1.13.

MacKinnon, James G. 2006. “Bootstrap Methods in Econometrics*.” *Economic Record* 82 (s1): S2–18. https://doi.org/10.1111/j.1475-4932.2006.00328.x.

MacKinnon, James G., and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” *Journal of Econometrics* 29 (3): 305–25. https://doi.org/10.1016/0304-4076(85)90158-7.

Nguyen, Tri-Long, Gary S. Collins, Jessica Spence, Jean-Pierre Daurès, P. J. Devereaux, Paul Landais, and Yannick Le Manach. 2017. “Double-Adjustment in Propensity Score Matching Analysis: Choosing a Threshold for Considering Residual Imbalance.” *BMC Medical Research Methodology* 17: 78. https://doi.org/10.1186/s12874-017-0338-0.

Rudolph, Kara E., K. Ellicott Colson, Elizabeth A. Stuart, and Jennifer Ahern. 2016. “Optimally Combining Propensity Score Subclasses.” *Statistics in Medicine* 35 (27): 4937–47. https://doi.org/10.1002/sim.7046.

Schafer, Joseph L., and Joseph Kang. 2008. “Average Causal Effects from Nonrandomized Studies: A Practical Guide and Simulated Example.” *Psychological Methods* 13 (4): 279–313. https://doi.org/10.1037/a0014268.

Snowden, Jonathan M., Sherri Rose, and Kathleen M. Mortimer. 2011. “Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique.” *American Journal of Epidemiology* 173 (7): 731–38. https://doi.org/10.1093/aje/kwq472.

Stampf, Susanne, Erika Graf, Claudia Schmoor, and Martin Schumacher. 2010. “Estimators and Confidence Intervals for the Marginal Odds Ratio Using Logistic Regression and Propensity Score Stratification.” *Statistics in Medicine* 29 (7-8): 760–69. https://doi.org/10.1002/sim.3811.

Wan, Fei. 2019. “Matched or Unmatched Analyses with Propensity-Scorematched Data?” *Statistics in Medicine* 38 (2): 289–300. https://doi.org/10.1002/sim.7976.

Wang, Shirley V., Yinzhu Jin, Bruce Fireman, Susan Gruber, Mengdong He, Richard Wyss, HoJin Shin, et al. 2018. “Relative Performance of Propensity Score Matching Strategies for Subgroup Analyses.” *American Journal of Epidemiology* 187 (8): 1799–1807. https://doi.org/10.1093/aje/kwy049.

Westreich, D., and S. Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.” *American Journal of Epidemiology* 177 (4): 292–98. https://doi.org/10.1093/aje/kws412.

```
#Generatng data similar to Austin (2009) for demonstrating treatment effect estimation
<- function(n) {
gen_X <- matrix(rnorm(9 * n), nrow = n, ncol = 9)
X 5] <- as.numeric(X[,5] < .5)
X[,
X
}
#~20% treated
<- function(X) {
gen_A <- - 1.2 + log(2)*X[,1] - log(1.5)*X[,2] + log(2)*X[,4] - log(2.4)*X[,5] + log(2)*X[,7] - log(1.5)*X[,8]
LP_A <- plogis(LP_A)
P_A rbinom(nrow(X), 1, P_A)
}
# Continuous outcome
<- function(A, X) {
gen_Y_C 2*A + 2*X[,1] + 2*X[,2] + 2*X[,3] + 1*X[,4] + 2*X[,5] + 1*X[,6] + rnorm(length(A), 0, 5)
}#Conditional:
# MD: 2
#Marginal:
# MD: 2
# Binary outcome
<- function(A, X) {
gen_Y_B <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6]
LP_B <- plogis(LP_B)
P_B rbinom(length(A), 1, P_B)
}#Conditional:
# OR: 2.4
# logOR: .875
#Marginal:
# RD: .144
# RR: 1.54
# logRR: .433
# OR: 1.92
# logOR .655
# Survival outcome
<- function(A, X) {
gen_Y_S <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6]
LP_S sqrt(-log(runif(length(A)))*2e4*exp(-LP_S))
}#Conditional:
# HR: 2.4
# logHR: .875
#Marginal:
# HR: 1.57
# logHR: .452
set.seed(19599)
<- 2000
n <- gen_X(n)
X <- gen_A(X)
A
<- gen_Y_C(A, X)
Y_C <- gen_Y_B(A, X)
Y_B <- gen_Y_S(A, X)
Y_S
<- data.frame(A, X, Y_C, Y_B, Y_S) d
```

Because they are only appropriate with a large number of clusters, cluster-robust standard errors are generally not used with subclassification methods. Regular robust standard errors are valid with these methods when using the subclassification weights to estimate marginal effects.↩︎

Sometimes, an error will occur with this method, which usually means more bootstrap replications are required. The number of replicates must be greater than the original sample size when using the full bootstrap and greater than the number of pairs/strata when using the block bootstrap.↩︎

The matching weights are not necessary when performing 1:1 matching, but we include them here for generality. When weights are not necessary, including them does not affect the estimates. Because it may not always be clear when weights are required, we recommend always including them.↩︎

It is possible to exactly reproduce the

`match.data()`

standard error using the`get_matches()`

data, but doing so may require some fiddling due to the defaults in`sandwich`

.↩︎It is also known as fine stratification weighting, described by Desai et al. (2017).↩︎

Including subclass as a main effect in the MMWS-weighted regression will not change the effect estimate but may slightly decrease its estimated standard error.↩︎