Simulation using `simstudy`

has two primary steps. First, the user **defines** the data elements of a data set. Second, the user **generates** the data, using the definitions in the first step. Additional functionality exists to simulate observed or randomized **treatment assignment/exposures**, to generate **survival** data, to create **longitudinal/panel** data, to create **multi-level/hierarchical** data, to create datasets with **correlated variables** based on a specified covariance structure, to **merge** datasets, to create data sets with **missing** data, and to create non-linear relationships with underlying **spline** curves.

This vignette provides a brief introduction to the basics of generating data. For information on more elaborate data generating mechanisms, please visit https://www.rdatagen.net/page/simstudy/.

The key to simulating data in `simstudy`

is the creation of series of data definition tables that look like this:

varname | formula | variance | dist | link |
---|---|---|---|---|

nr | 7 | 0e+00 | nonrandom | identity |

x1 | 10;20 | 0e+00 | uniform | identity |

y1 | nr + x1 * 2 | 8e+00 | normal | identity |

y2 | nr - 0.2 * x1 | 0e+00 | poisson | log |

xnb | nr - 0.2 * x1 | 5e-02 | negBinomial | log |

xCat | 0.3;0.2;0.5 | 0e+00 | categorical | identity |

g1 | 5+xCat | 1e+00 | gamma | log |

b1 | 1+0.3*xCat | 1e+00 | beta | logit |

a1 | -3 + xCat | 0e+00 | binary | logit |

a2 | -3 + xCat | 1e+02 | binomial | logit |

These *definition* tables can be generated two ways. One option is to to use any external editor that allows the creation of `csv`

files, which can be read in with a call to `defRead`

. An alternative is to make repeated calls to the function `defData`

. Here, we illustrate the R code that builds this definition table internally:

```
def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1",
link = "log")
def <- defData(def, varname = "xnb", dist = "negBinomial", formula = "nr - 0.2 * x1",
variance = 0.05, link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1,
link = "log")
def <- defData(def, varname = "b1", dist = "beta", formula = "1+0.3*xCat", variance = 1,
link = "logit")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat",
link = "logit")
def <- defData(def, varname = "a2", dist = "binomial", formula = "-3 + xCat",
variance = 100, link = "logit")
```

The first call to `defData`

without specifying a definition name (in this example the definition name is *def*) creates a **new** data.table with a single row. An additional row is added to the table `def`

each time the function `defData`

is called. Each of these calls is the definition of a new field in the data set that will be generated. In this example, the first data field is named ‘nr’, defined as a constant with a value to be 7. In each call to `defData`

the user defines a variable name, a distribution (the default is ‘normal’), a mean formula (if applicable), a variance parameter (if applicable), and a link function for the mean (defaults to ‘identity’).

The possible distributions include **normal**, **gamma**, **poisson**, **zero-truncated poisson**, **negative binomial**, **binary**, **binomial**, **beta**, **uniform**, **uniform integer**, **categorical**, and **deterministic/non-random**. For all of these distributions, key parameters defining the distribution are entered in the `formula`

, `variance`

, and `link`

fields.

In the case of the **normal**, **gamma**, **beta**, and **negative binomial** distributions, the formula specifies the mean. The formula can be a scalar value (number) or a string that represents a function of previously defined variables in the data set definition (or, as we will see later, in a previously generated data set). In the example, the mean of `y1`

, a normally distributed value, is declared as a linear function of `nr`

and `x1`

, and the mean of `g1`

is a function of the category defined by `xCat`

. The `variance`

field is defined only for normal, gamma, beta, and negative binomial random variables, and can only be defined as a scalar value. In the case of gamma, beta, and negative binomial variables, the value entered in variance field is really a dispersion value \(d\). The variance of a gamma distributed variable will be \(d \times mean^2\), for a beta distributed variable will be \(mean \times (1- mean)/(1 + d)\), and for a negative binomial distributed variable, the variance will be \(mean + d*mean^2\).

In the case of the **poisson**, **zero-truncated poisson**, and **binary** distributions, the formula also specifies the mean. The variance is not a valid parameter in these cases, but the `link`

field is. The default link is ‘identity’ but a ‘log’ link is available for the Poisson distributions and a “logit” link is available for the binary outcomes. In this example, `y2`

is defined as Poisson random variable with a mean that is function of `nr`

and `x1`

on the log scale. For binary variables, which take a value of 0 or 1, the formula represents probability (with the ‘identity’ link) or log odds (with the ‘logit’ link) of the variable having a value of 1. In the example, `a1`

has been defined as a binary random variable with a log odds that is a function of `xCat`

.

In the case of the *binomial* distribution, the formula specifies the probability of success \(p\), and the variance field is used to specify the number of trials \(n\). The mean of this distribution is \(n*p\), and the variance is \(n*p*(1-p)\).

Variables defined with a **uniform**, **uniform integer**, **categorical**, or **deterministic/non-random** distribution are specified using the formula only. The `variance`

and `link`

fields are not used in these cases.

For a uniformly distributed variable, The formula is a string with the format “a;b”, where *a* and *b* are scalars or functions of previously defined variables. The uniform distribution has two parameters - the minimum and the maximum. In this case, *a* represents the minimum and *b* represents the maximum.

For a categorical variable with \(k\) categories, the formula is a string of probabilities that sum to 1: “\(p_1 ; p_2 ; ... ; p_k\)”. \(p_1\) is the probability of the random variable falling category 1, \(p_2\) is the probability of category 2, etc. The probabilities can be specified as functions of other variables previously defined. In the example, `xCat`

has three possibilities with probabilities 0.3, 0.2, and 0.5, respectively.

Non-random variables are defined by the formula. Since these variables are deterministic, variance is not relevant. They can be functions of previously defined variables or a scalar, as we see in the sample for variable defined as `nr`

.

After the data set definitions have been created, a new data set with \(n\) observations can be created with a call to function ** genData**. In this example, 1,000 observations are generated using the data set definitions in

`def`

`dt`

```
## idnum nr x1 y1 y2 xnb xCat g1 b1 a1 a2
## 1: 1 7 18.71470 48.13110 25 36 3 882.3611 0.9707256 1 48
## 2: 2 7 12.63977 34.82680 87 97 3 1986.9499 0.8497208 1 54
## 3: 3 7 13.21247 34.96022 80 71 2 1460.2205 0.7136439 0 26
## 4: 4 7 19.21613 38.93975 17 16 1 77.6381 0.9997095 0 14
## 5: 5 7 10.70988 24.16021 148 110 2 696.4741 0.9546023 0 32
## ---
## 996: 996 7 12.69114 34.43474 88 117 1 480.9245 0.3714678 0 10
## 997: 997 7 11.48129 31.34903 108 125 1 235.4808 0.9427933 0 13
## 998: 998 7 16.88184 41.60436 45 50 3 2425.0456 0.9999994 1 51
## 999: 999 7 10.24263 25.36589 151 107 3 2537.5048 0.9978473 0 49
## 1000: 1000 7 12.72076 33.53079 78 70 1 605.9685 0.4719441 0 13
```

New data can be added to an existing data set with a call to function ** addColumns**. The new data definitions are created with a call to

`defData`

`addColumns`

```
addef <- defDataAdd(varname = "zExtra", dist = "normal", formula = "3 + y1",
variance = 2)
dt <- addColumns(addef, dt)
dt
```

```
## idnum nr x1 y1 y2 xnb xCat g1 b1 a1 a2
## 1: 1 7 18.71470 48.13110 25 36 3 882.3611 0.9707256 1 48
## 2: 2 7 12.63977 34.82680 87 97 3 1986.9499 0.8497208 1 54
## 3: 3 7 13.21247 34.96022 80 71 2 1460.2205 0.7136439 0 26
## 4: 4 7 19.21613 38.93975 17 16 1 77.6381 0.9997095 0 14
## 5: 5 7 10.70988 24.16021 148 110 2 696.4741 0.9546023 0 32
## ---
## 996: 996 7 12.69114 34.43474 88 117 1 480.9245 0.3714678 0 10
## 997: 997 7 11.48129 31.34903 108 125 1 235.4808 0.9427933 0 13
## 998: 998 7 16.88184 41.60436 45 50 3 2425.0456 0.9999994 1 51
## 999: 999 7 10.24263 25.36589 151 107 3 2537.5048 0.9978473 0 49
## 1000: 1000 7 12.72076 33.53079 78 70 1 605.9685 0.4719441 0 13
## zExtra
## 1: 53.34158
## 2: 35.55258
## 3: 39.74581
## 4: 38.61562
## 5: 27.66564
## ---
## 996: 39.48614
## 997: 35.11438
## 998: 43.95011
## 999: 27.75513
## 1000: 38.88947
```

Treatment assignment can be accomplished through the original data generation process, using `defData`

and `genData`

. However, the functions `trtAssign`

and `trtObserve`

provide more options to generate treatment assignment.

Treatment assignment can simulate how treatment is made in a randomized study. Assignment to treatment groups can be (close to) balanced (as would occur in a block randomized trial); this balancing can be done without or without strata. Alternatively, the assignment can be left to chance without blocking; in this case, balance across treatment groups is not guaranteed, particularly with small sample sizes.

First, create the data definition:

```
def <- defData(varname = "male", dist = "binary", formula = 0.5, id = "cid")
def <- defData(def, varname = "over65", dist = "binary", formula = "-1.7 + .8*male",
link = "logit")
def <- defData(def, varname = "baseDBP", dist = "normal", formula = 70, variance = 40)
dtstudy <- genData(330, def)
```

*Balanced treatment assignment, stratified by gender and age category (not blood pressure)*

```
study1 <- trtAssign(dtstudy, n = 3, balanced = TRUE, strata = c("male", "over65"),
grpName = "rxGrp")
study1
```

```
## cid rxGrp male over65 baseDBP
## 1: 1 3 1 0 69.71811
## 2: 2 1 0 0 68.21481
## 3: 3 3 1 0 63.64589
## 4: 4 2 0 0 67.40492
## 5: 5 3 0 0 72.96366
## ---
## 326: 326 1 1 1 67.99205
## 327: 327 3 1 0 55.73555
## 328: 328 3 1 0 74.85385
## 329: 329 3 1 0 75.20651
## 330: 330 3 1 0 66.24796
```

*Balanced treatment assignment (without stratification)*

*Random (unbalanced) treatment assignment*

*Comparison of three treatment assignment mechanisms*

If exposure or treatment is observed (rather than randomly assigned), use `trtObserve`

to generate groups. There may be any number of possible exposure or treatment groups, and the probability of exposure to a specific level can depend on covariates already in the data set. In this case, there are three exposure groups that vary by gender and age:

```
formula1 <- c("-2 + 2*male - .5*over65", "-1 + 2*male + .5*over65")
dtExp <- trtObserve(dtstudy, formulas = formula1, logit.link = TRUE, grpName = "exposure")
```

Here are the exposure distributions by gender and age:

Here is a second case of three exposures where the exposure is independent of any covariates. Note that specifying the formula as `c(.35, .45)`

is the same as specifying it is `c(.35, .45, .20)`

. Also, when referring to probabilities, the identity link is used: