logicDT (Logic Decision Trees) is a statistical learning procedure for identifying response-associated interactions between binary predictors. It is aimed at yielding models with a high predictive power while maintaining interpretability. The main idea consists of performing a global search over the set of predictors and Boolean conjunctions of these and using these terms as input variables of decision trees.

You can install the released version of logicDT from CRAN with:

Here is an example of an epidemiological toy data set consisting of some SNPs, an environmental covariable and a target disease status.

```
set.seed(123)
maf <- 0.25 # Minor allele frequency of 25%
n.snps <- 50
N <- 2000 # 2000 samples, the first 1000 for training the remaining 1000 for testing
X <- matrix(sample(0:2, n.snps * N, replace = TRUE,
prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)), ncol = n.snps)
colnames(X) <- paste("SNP", 1:n.snps, sep="")
X <- splitSNPs(X)
Z <- matrix(rnorm(N, 20, 10), ncol = 1)
colnames(Z) <- "E"
Z[Z < 0] <- 0
# Response generated following a SNP exhibiting a strong marginal effect and
# a more complicated gene-environment interaction
y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) +
log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) +
rnorm(N, 0, 1)
```

The first half of the data set is used for training and the second half for testing. The maximum number of variables in the model is set to 3 and the maximum number of terms to 2. Trees shall be fitted until a minimum node size of 5% of the training data is reached.

```
model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
max_vars = 3, max_conj = 2,
search_algo = "sa",
tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
simplify = "vars",
allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2))
```

Here, the NRMSE (normalized root mean squared error) is used for evaluting the model on the test data.

```
calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
y[(N/2+1):N])
#> [1] 0.7880122
```

Since a quantitative covariable was included (via 4pL (four parameter logistic) models), the leaves hold continuous function as can be seen below.

For maximizing the predictive performance, the ensembling methods bagging and gradient boosting are implemented as well.

```
model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
bagging.iter = 50,
max_vars = 3, max_conj = 3,
search_algo = "greedy",
tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
simplify = "vars",
conjsize = floor(0.05 * nrow(X)/2))
```

```
calcNRMSE(predict(model.bagged, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
y[(N/2+1):N])
#> [1] 0.7913782
```

```
model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
boosting.iter = 50, learning.rate = 0.01,
subsample.frac = 0.75, replace = FALSE,
max_vars = 3, max_conj = 3,
search_algo = "greedy",
tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
simplify = "vars",
conjsize = floor(0.05 * nrow(X)/2))
```

```
calcNRMSE(predict(model.boosted, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
y[(N/2+1):N])
#> [1] 0.8761629
```

Several VIMs and adjustment techniques for estimating the importances of interactions are implemented. For estimating VIMs, it is recommended to use bagged logicDT models, since they usually cover a wider range of input terms. Plus, no additional data set for an unbiased estimation is required, since for bagging, the OOB (out of bag) samples can be used.