Introduction to stratamatch

Rachael C. Aikens


Summary: In a block-randomized controlled trial, the individuals in a study are divided based on important characteristics (for example age group, sex, or smoking status). By stratifying the data in this way prior to randomization, the researchers can reduce the heterogeneity of their treatment and control groups, thereby reducing the variance of their estimate of the effect of their treatment. The stratamatch package is designed to extend this approach to the observational setting by providing functions to allow users to easily divide an observational data set into strata of observations with similar baseline characteristics. The treated and control individuals within each stratum of the data can then by matched (for example, based on propensity score), in order to recapitulate the block-randomized trial from observational data. In order to select an effective stratification scheme for the data, stratamatch applies a pilot design approach to estimate a quantity called the prognostic score, and divides the data into blocks of individuals with similar scores. The potential benefits of such an approach are twofold. First, stratifying the data effectively allows for more computationally efficient matching of large data sets. Second, early studies of the prognostic score show that using the prognostic score to inform the matching process may help reduce the variance in the estimated treatment effect and increase the robustness of an observational study to bias from unmeasured confounding factors.


This document contains the following sections

  1. Introduction and Background
  2. Generating an Example Data Set
  3. How to Stratify a Data Set
  4. Diagnostics for Stratified Data
  5. Matching Stratified Data

After briefly introducing the goals of stratamatch and the approaches it implements on a statistical level, this document will walk through an example of how stratamatch might be used on a toy data set. We will show how to stratify a data set, how to diagnose the quality of the stratification, and how to match a data set within strata.

Introduction and Background

In a fully-randomized controlled experiment, treatment assignments are made independently of each individual’s baseline covariates, allowing for unbiased estimation of the treatment effect. In observational studies, researchers may attempt to account for differences in their subject’s baseline covariates by matching treated and control subjects with similar characteristics. In particular, the popular approach of propenisty score matching pairs individuals who had similar estimated probabilities of recieving the treatment based on their baseline characteristics. By balancing the treated and control groups in this way, the propensity score matching approach attempts to coerce the data set into a form that resembles a randomized controlled trial. Importantly, however, propensity score matching can only address bias to due measured baseline covariates. Imbalances in unmeasured baseline covariates may still bias the effect estimate after propensity score matching. For this reason, researchers often carry out a sensitivity analysis to address concerns about unmeasured confounding.

However, the fully-randomized experiment is not the only experimental study design. In a block-randomized controlled experiment, subjects are stratified based on important covariates (e.g. sex, age group, smoking status) before randomization into treatment groups. This helps reduce the heterogeneity between treatment and control groups. In the experimental context, reducing the heterogeneity between compared groups helps to reduce the variance of the effect estimator. In observational settings however, this reduction in heterogeneity has the added benefit of reducing the sensitivity of the study results to unobserved confounding (see Rosenbaum (2005), for a great discussion on this point). Moreover, if the sample size of an observational study is very large, stratifying the sample and matching separately within the strata in this way may be much faster than matching the entire sample at once. Stratamatch helps to facilitate this process by supplying tools stratify an observational dataset and match within each stratum, thereby emulating an observational version of the block-randomized controlled trial.

Once we have decided to stratify our data set, the question becomes: how should strata be determined? One option is to select the covariates on which to stratify by hand based on expert knowledge. Another option is to use knowledge from previous experiments to select the best substrata. In the experimental setting, this may be done with a pilot study. Using this approach, researchers set aside some of their resources before running the main experiment for the purpose of running a smaller, ‘pilot’ experiment. By examining the outcomes of the pilot study, they can gather information which they can use to inform the design of the main experiment. Importantly, after the pilot study is run, the individuals in the pilot experiment are not reused in the main experiment, so the observations of the outcomes from this earlier analysis are not allowed to bias the results of the main study.

Aikens et al (2019) extend the idea of the pilot study to the observational setting. Using an observational pilot design, the researchers may set aside a ‘pilot set’ of their data. Outcome information on this pilot set can be observed, and the information gained can be used to inform the study design. Subsequently, in order to preserve the separation of the study design from the study analysis, the observations from the pilot set are omitted from the main analysis.

How stratamatch Works

Stratamatch uses a pilot design to estimate a quantity called the prognostic score (Hansen (2008)), defined here as an individual’s expected outcome under the control assignment, based on their baseline covariates. Balancing observational data sets based on the prognostic score can the reduce heterogeneity between matched individuals, decreasing variance and diminishing the sensitivity of the study results to unobserved confounding (See Aikens et al (2019); Antonelli et al (2017); and Leacy and Stuart (2014)) . Moreover, since the prognostic score is often continuous, strata can be easily determined using prognostic score quantiles to select evenly sized bins for the data. This circumvents common problems with stratification based on expert knowledge, since that process often generates strata which are too large, too small, or too poorly balanced between treatment and control observations.

The stratamatch function, auto_stratify, carries out the prognostic score stratification pilot design described above. Although there are many additional options available when running this function, the most basic procedure does the following:

  1. Partition the data set into a pilot data set and an analysis data set

  2. Fit a model for the prognostic score from the observations in the pilot set

  3. Estimate prognostic scores for the analysis set using the prognostic model

  4. Stratify the analysis set based on prognostic score quantiles.

Once the data set has been stratified, stratamatch suggests several diagnostic plots and tables that can be used to assess the size and balance of the strata produced. If the strata are satisfactory, the treatment and control individuals can then be matched. At this point, the reader may choose to match the data set within each stratum using the strata_match function, or they may select and implement their own matching scheme.

Generating an Example Data Set

We demonstrate the functionality of stratamatch with a toy example. The stratamatch package contains its own function, make_sample_data, for just this purpose.


# make sample data set of 5000 observations
dat <- make_sample_data(n = 5000)

# print the first few rows of the sample_data
##            X1         X2 B1 B2 C1 treat outcome
## 1  0.93332697 0.22256993  1  1  b     0       0
## 2 -0.52503178 1.07623706  1  0  c     1       0
## 3  1.81443979 2.83989785  1  1  b     0       0
## 4  0.08304562 0.04686229  0  0  a     1       0
## 5  0.39571880 0.42609036  0  0  b     0       0
## 6 -2.19366962 1.28452442  1  1  c     0       1

Let’s assume that the data in dat are an observational data set, and we would like to estimate the effect of some binary treatment on a binary outcome. Suppose the column treat in this data gives the treatment assigment (where a 0 means untreated and a 1 means treated), and the column outcome gives information on who had the outcome of interest and who didn’t (similarly, let’s call 0 the negative outcome and 1 the positive outcome). However, since this is an observational data set, we didn’t get to choose who got the treatment and who didn’t. Instead, individuals self-selected into treatment or control groups, perhaps in some way which is influenced by their background characteristics. Additionally, an individual’s probability of having the positive outcome may be influenced by the treatment or by other baseline factors, but we don’t know which. In addition to treatment assignments and outcomes, we have measured five baseline covariates for each individual: X1, X2, B1, B2, and C1. X1 and X2 are continuous, B1 and B2 are binary, and C1 is a categorical variable which takes on possible values "a", "b", and "c".

We should also note that dat is a fairly large data set (5000 observations), but only about 1/5 of the individuals in the data set recieved the treatment.

How to Stratify a Data Set

Manual Stratification

Suppose we wanted to stratify the data manually based on covariates that our experts have selected. Importantly, we can only stratify the data based on categorical or binary variables. This means that we cannot use X1 or X2 directly (this would cause manual_stratify to throw an error). Below, I stratify the data set based on C1, B1, and B2.

## manual_strata object from package stratamatch.
## Function call:
## manual_stratify(data = dat, strata_formula = treat ~ B1 + B2 + 
##     C1)
## Analysis set dimensions: 5000 X 8
## Number of strata: 12 
##  Min size: 33    Max size: 1533
## Strata with Potential Issues: 8

Above, we see that manual_stratify returns a manual strata object. The printed summary above shows that manual_stratify has divided our dataset according to our instructions and produced 12 strata.

We can extract important information from m.strat with $.

The most important piece of m.strat is the analysis set. This is the stratified data which we can later match and use for effect estimation.

## # A tibble: 6 x 8
##        X1     X2    B1    B2 C1    treat outcome stratum
##     <dbl>  <dbl> <int> <int> <chr> <int>   <int>   <int>
## 1  0.933  0.223      1     1 b         0       0      11
## 2 -0.525  1.08       1     0 c         1       0       9
## 3  1.81   2.84       1     1 b         0       0      11
## 4  0.0830 0.0469     0     0 a         1       0       1
## 5  0.396  0.426      0     0 b         0       0       2
## 6 -2.19   1.28       1     1 c         0       1      12

Since we didn’t use a pilot design for this example, our analysis set just has all of the same data in dat that we put in to manual_stratify. The only difference is that now there is an additional column, stratum, which says which stratum each individual has been sorted into.

Automatic Stratification

In contrast to manual_stratify, the auto_stratify function automatically creates the stratified data set based on estimated prognostic scores. In this process, we specify what model we want to use for the prognostic score (the prognosis argument) and what percent of the control reserve we want to use to do the fitting (the pilot_fraction) argument. We also need to tell auto_stratify which column of our data set designates the treatment assignment (treat). The final argument, size, specifies about how large we would like our strata to be. I pick size = 400 here so that we get roughly the same number of strata as when we manually stratified our data set.

In practice, pilot_fraction and size have defaults, so we don’t need to specify them every time. Run help(auto_stratify) for more information.

## Constructing a pilot set by subsampling 10% of controls.
## Fitting prognostic model via logistic regression: outcome ~ X1 + X2
## auto_strata object from package stratamatch.
## Function call:
## auto_stratify(data = dat, treat = "treat", prognosis = outcome ~ 
##     X1 + X2, size = 400, pilot_fraction = 0.1)
## Analysis set dimensions: 4619 X 8
## Pilot set dimensions: 381 X 7
## Prognostic Score Formula:
## outcome ~ X1 + X2
## Number of strata: 12 
##  Min size: 384   Max size: 385
## Strata with Potential Issues: 2

Our call to auto_stratify has created auto_strata object. An auto_strata object is much like a manual_strata object, except that it contains somewhat more information, since the stratification process was done differently. Importantly, auto_strata has partitioned 10% of the individuals in the control group to be in the pilot set. These individuals were used to build the prognostic score, and were extracted from the analysis set. In order to prevent overfitting, the observations in the pilot set should not be included in later analyses, and should not be reused when making the final effect estimate.

We can access the pilot set and the analysis set using $ with a.strat. The command a.strat$analysis_set gives the analysis set and a.strat$pilot_set gives the pilot set. We can also view the prognostic score estimates for the individuals in the analysis set with a.strat$prognostic_scores.

Alternative methods for automatic stratification.

Above, we showed the default method used by auto_stratify. By providing other arguments, it is possible to construct the strata using various other methods. For example, one could fit the prognostic scores for the analysis set separately and provide them as an argument to auto_stratify. Alternatively, the researcher could specify their own pilot set that they would like to use to build the prognostic model. Run help(auto_stratify) to see all of the possible inputs to this function.

Diagnostics for Stratified Data Sets

Strata Tables

Both manual_stratify and auto_stratify objects have a strata table, which shows the rules defining each stratum. Just like when we extracted the analysis set, we can extract the strata_table with $. Below, I show the strata table for m.strat, our manually stratified data. Each row describes the definitions used to make one stratum. For example, below we see that stratum 1 contains all individuals who have B1 = 0 and B2 = 0, and C1 = "a" :

## # A tibble: 12 x 5
## # Groups:   B1, B2 [4]
##       B1    B2 C1    stratum  size
##    <int> <int> <chr>   <int> <int>
##  1     0     0 a           1   359
##  2     0     0 b           2   360
##  3     0     0 c           3  1492
##  4     0     1 a           4    39
##  5     0     1 b           5    51
##  6     0     1 c           6   150
##  7     1     0 a           7   175
##  8     1     0 b           8    40
##  9     1     0 c           9    33
## 10     1     1 a          10  1533
## 11     1     1 b          11   383
## 12     1     1 c          12   385

The strata table for an automatically stratified data set is somewhat different:

## # A tibble: 12 x 3
##    stratum quantile_bin    size
##      <int> <chr>          <int>
##  1       1 [0.0461,0.241)   385
##  2       2 [0.2409,0.307)   385
##  3       3 [0.3068,0.352)   385
##  4       4 [0.3524,0.400)   385
##  5       5 [0.4004,0.445)   385
##  6       6 [0.4454,0.482)   385
##  7       7 [0.4825,0.526)   385
##  8       8 [0.5264,0.569)   385
##  9       9 [0.5685,0.615)   385
## 10      10 [0.6149,0.668)   385
## 11      11 [0.6676,0.735)   385
## 12      12 [0.7353,0.946]   384

The strata table for a.strat shows what prognostic score quantiles were used to divide the strata. For example, individuals in stratum 1 have the lowest prognostic scores. This means that, if the individuals in stratum 1 were in the control group, we would predict them to have a very low probability of having the positive outcome, based solely on their values of X1 and X2.

Issue Tables

Another important diagnostic for both manually and automatically stratified data is the issue_table.

## # A tibble: 12 x 6
##    Stratum Treat Control Total Control_Proporti… Potential_Issues          
##      <int> <int>   <int> <int>             <dbl> <chr>                     
##  1       1   125     234   359             0.652 none                      
##  2       2   105     255   360             0.708 none                      
##  3       3   451    1041  1492             0.698 none                      
##  4       4     2      37    39             0.949 Too few samples; Not enou…
##  5       5     3      48    51             0.941 Too few samples; Not enou…
##  6       6    12     138   150             0.92  Not enough treated samples
##  7       7   104      71   175             0.406 none                      
##  8       8    25      15    40             0.375 Too few samples           
##  9       9    18      15    33             0.455 Too few samples           
## 10      10   241    1292  1533             0.843 Not enough treated samples
## 11      11    51     332   383             0.867 Not enough treated samples
## 12      12    55     330   385             0.857 Not enough treated samples

This table shows each of the strata that we obtained through manual stratification and some of the potential issues we may face when trying to match within these strata. In particular, this table highlights some strata which are too small and/or have many more treated individuals than controls. The Total column shows the total number of observations in each stratum. Some of the strata here are quite a bit larger than others. In practice, data sets of about 4000 start to take a long time to match, so that’s about the maximum size we would like for our strata. In data sets of fewer than 100 individuals, it can be hard to find good matches between our treated and control individuals.

One benefit of automatic stratification is that it tends to create strata of roughly equal size, so that no strata are extremely small or extremely large. We can see this by comparing the Total column in the issue tables of the manually and automatically stratified data sets:

## # A tibble: 12 x 6
##    Stratum Treat Control Total Control_Proportion Potential_Issues         
##      <int> <int>   <int> <int>              <dbl> <chr>                    
##  1       1   133     252   385              0.655 none                     
##  2       2   131     254   385              0.660 none                     
##  3       3   110     275   385              0.714 none                     
##  4       4   112     273   385              0.709 none                     
##  5       5   106     279   385              0.725 none                     
##  6       6   124     261   385              0.678 none                     
##  7       7    78     307   385              0.797 none                     
##  8       8    75     310   385              0.805 Not enough treated sampl…
##  9       9    97     288   385              0.748 none                     
## 10      10    81     304   385              0.790 none                     
## 11      11    84     301   385              0.782 none                     
## 12      12    61     323   384              0.841 Not enough treated sampl…


Size-Ratio Plots

Plots are an important way that we can check how our statification went. There are two types of plots that we can make automatically with either a manual_strata object or an auto_strata object. The first is a size-ratio plot. This plots each stratum in the analysis set based on its size and the percentage of control observations. If a stratum is too small or is very imbalanced in its treat:control ratio, finding quality matches may be difficult (shaded yellow areas). If a stratum is too large, matching may be very computationally taxing (shaded red areas). We can see here that a few strata from our manual stratification are quite small, and some have nearly entirely control individuals.

We can compare this with a size-ratio plot from automatic stratification:

This plot clearly shows that the strata generated by auto_stratify are all about the same size. However, some of the strata still have a treat:control ratio much less than 1:1. Some of this is unavoidable because our data set naturally had a treat:control ratio of about 1:5. However, we should try to avoid having very extreme treat:control ratios in any of our strata.

Propensity Score Histograms

The second plot we can make to diagnose issues both manually and automatically stratified data is an overlap histogram. Below, I’ve made an overlap histogram for stratum 1. This shows the propensity score distributions of the treated and control populations. In order to make this histogram, we must tell the plot function what the propensity scores are for our data set by specifying the propensityargument. We have three options: propensity can be a propensity score formula (which will then be fit on the analysis set), a glm modeling propensity score, or a vector of propensity scores. Here, I just pass in a formula.

A similar command works for automatically stratified data:

Fisher-Mill Plots for Automatically Stratified Data

In addition to the two plot types above, there are some plots that can be used to visualize auto_strata objects only: "residual" and "FM" plots.

Setting type = "FM" in the plot command produces a Fisher-Mill plot for a given stratum. This shows each individual in the stratum in terms of their estimated propensity score and their estimated prognostic score (for more details, see Aikens et al (2019)). This allows us to check how well the treated and control samples overlap not only in their probability for treatment but in their expected outcome under the control assignment. Just as with the overlap histograms, we need to provide a stratum number and information on the propensity score. Below, I use the same propensity formula as in the histogram above. Below, we see relatively good overlap. Prognostic score has a hard upper limit because we are looking at the first stratum, which means these are the lowest prognostic scores in our data set.

Prognostic Model Plots for Automatically Stratified Data

Finally, if we fit a prognostic model, then we can run diagnostics on that prognostic model with type = "residual". This is essentially just a wrapper for the glm plotting method. It produces all the familiar diagnostic plots that we can obtain from calling plot on a fitted model from glm.

Prognostic Model Diagnostics for Automatically Stratified Data

The plot command with type = "residual" gives us a one-line command that shows us standard diagnostic plots for the prognostic score model that we fit on the pilot set. If we want to run extra diagnostics on our prognostic model, we can do that too. Our a.strat stores a copy of the prognostic score model, which we can extract in the usual way with $. By extracting the prognostic model (prognostic_model), we can run any of the usual diagnostics for our fitted model.

## Call:
## glm(formula = prognostic_formula, family = "binomial", data = dat)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7627  -1.0509  -0.6498   1.0808   1.9692  
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.17106    0.15821  -1.081    0.280    
## X1          -0.77962    0.12341  -6.318 2.66e-10 ***
## X2           0.09660    0.09498   1.017    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 527.73  on 380  degrees of freedom
## Residual deviance: 480.76  on 378  degrees of freedom
## AIC: 486.76
## Number of Fisher Scoring iterations: 4

Based on the coefficients from the prognostic model, we can tell that our prognostic score estimates are primarily determined by each individual’s X2 baseline value.

Matching Stratified Data

Now that we have stratified the data, we can match the individuals within each stratum.

The most flexible strategy - albeit the most work intensive one - is for the researcher to do this step by hand, using the strata designations in the analysis set returned by either one of the stratification methods. If the researcher plans to do something very specific or nuanced in the matching process, this may be the best approach. For example, users proficient with the matching package optmatch will note that adding + strata(stratum) to the matching formula supplied to optmatch will request that optmatch perform the matching within each stratum in the analysis set (see the optmatch documentation for further details). Another approach is to divide the analysis_set into separate data frames and match on those individually, perhaps distributing over serval computing clusters or using a different matching scheme in each stratum (for example different \(k\) in \(1:k\) matching)

If the goal is something more simple, i.e. a 1-to-1 or 1-to-\(k\) optimal propensity score matching within strata, the strata_match function can do that automatically. Below, I match two control individuals to each treatment individual within strata in my analysis set provided by a.strat. I use the propensity score formula treat ~ X1 + X2 + B1 + B2.

# match the automatically stratified data
mymatch <- strata_match(a.strat, treat ~ X1 + X2 + B1 + B2, k = 1)
## This function makes essential use of the optmatch package, which has an academic license.
## For more information, run optmatch::relaxinfo()
## Fitting propensity model: treat ~ X1 + X2 + B1 + B2 + strata(stratum)
# summarize matching results
## Structure of matched sets:
##  1:1  0:1 
## 1192 2235 
## Effective Sample Size:  1192 
## (equivalent number of matched pairs).

This function in turn calls optmatch to do the matching, stratified by the strata assignments. It returns an optmatch matching. Inessence, each individual is given an identification code for its match. <NA> indicates that the individual was not matched, and an identifier such as 3.15 indicates that this individual was in stratum 3 and it was placed in match 15.

# add match information as a column in the data set
matched_data <- a.strat$analysis_set
matched_data$match <- as.character(mymatch)

##            X1         X2 B1 B2 C1 treat outcome stratum match
## 1  0.93332697 0.22256993  1  1  b     0       0       2 2.125
## 2 -0.52503178 1.07623706  1  0  c     1       0       9  9.22
## 3  1.81443979 2.83989785  1  1  b     0       0       1  <NA>
## 4  0.08304562 0.04686229  0  0  a     1       0       5  5.82
## 5 -0.36031653 1.69169297  0  0  c     0       1       8  <NA>
## 6  0.14285392 0.68549042  1  1  a     0       0       6  <NA>

Effect estimation can then be done using the matched analysis set via researcher’s method of choice.