Caution

Cautionary notes for drake

William Michael Landau

2017-11-05

Drake has edge cases, pitfalls, and weaknesses that may or may not be fixed in future releases. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version. In this vignette, I will try to address some of the main issues to keep in mind for writing reproducible workflows safely.

1 Workflow plans

1.1 Beware unparsable symbols in your workflow plan.

In your workflow plan, be sure that target names can be parsed as symbols and commands can be parsed as R code. To be safe, use check(my_plan) to screen for illegal symbols and other problem areas.

A common pitfall is using the evaluate() function to expand wildcards after applying single quotes to file targets.

library(magrittr) # for the pipe operator %>%
workplan(
  data = readRDS("data_..datasize...rds")
) %>%
  rbind(drake::workplan(
    file.csv = write.csv(
      data_..datasize.., # nolint
      "file_..datasize...csv"
    ),
    strings_in_dots = "literals",
    file_targets = T
  )) %>%
  evaluate(
    rules = list(..datasize.. = c("small", "large"))
  )
##             target                                 command
## 1       data_small               readRDS('data_small.rds')
## 2       data_large               readRDS('data_large.rds')
## 3 'file.csv'_small write.csv(data_small, "file_small.csv")
## 4 'file.csv'_large write.csv(data_large, "file_large.csv")

The single quotes in the middle of 'file.csv'_small and 'file.csv'large are illegal, and the target names do not even correspond to the files written. Instead, construct your workflow plan in multiple stages and apply the single quotes at the very end.

rules <- list(..datasize.. = c("small", "large"))
datasets <- workplan(data = readRDS("data_..datasize...rds")) %>%
  evaluate(rules = rules)

Plan the CSV files separately.

files <- workplan(
  file = write.csv(data_..datasize.., "file_..datasize...csv"), # nolint
  strings_in_dots = "literals"
) %>%
  evaluate(rules = rules)

Single-quote the file targets after evaluate().

files$target <- paste0(
  files$target, ".csv"
) %>%
  as_file

Put the workflow plan together.

rbind(datasets, files)
##             target                                 command
## 1       data_small               readRDS('data_small.rds')
## 2       data_large               readRDS('data_large.rds')
## 3 'file_small.csv' write.csv(data_small, "file_small.csv")
## 4 'file_large.csv' write.csv(data_large, "file_large.csv")

For more control over target names in cases like this, you may want to use the wildcard package.

1.2 Commands are NOT perfectly flexible.

In your workflow plan data frame (produced by workplan() and accepted by make()), your commands can usually be flexible R expressions.

workplan(
  target1 = 1 + 1 - sqrt(sqrt(3)),
  target2 = my_function(web_scraped_data) %>% my_tidy
)
##    target                                   command
## 1 target1                     1 + 1 - sqrt(sqrt(3))
## 2 target2 my_function(web_scraped_data) %>% my_tidy

However, please try to avoid formulas and function definitions in your commands. You may be able to get away with workplan(f = function(x){x + 1}) or workplan(f = y ~ x) in some use cases, but be careful. Rather than using commands for this, it is better to define functions and formulas in your workspace before calling make(). (Alternatively, use the envir argument to make() to tightly control which imported functions are available.) Use the check() function to help screen and quality-control your workflow plan data frame, use tracked() to see the items that are reproducibly tracked, and use plot_graph() and build_graph() to see the dependency structure of your project.

2 Execution environment and files

2.1 Install drake properly.

You must properly install drake using install.packages(), devtools::install_github(), or similar. It is not enough to use devtools::load_all(), particularly for the parallel computing functionality, in which multiple R sessions initialize and then try to require(drake).

2.2 Install all your packages.

Your workflow may depend on external packages such as ggplot2, dplyr, or MASS. Such packages must be formally installed with install.packages(), devtools::install_github(), devtools::install_local(), or a similar command. If you load uninstalled packages with devtools::load_all(), results may be unpredictable and incorrect.

2.3 Find and diagnose your errors.

When a make() fails, use failed() and diagnose() to debug. Try the following out yourself.

diagnose()
f <- function(){
  stop("unusual error")
}
bad_plan <- workplan(target = f())
make(bad_plan)
failed() # From the last make() only
diagnose() # From all previous make()s
error <- diagnose(y)
str(error)
error$calls # View the traceback.

2.4 Your workspace is modified by default.

As of version 3.0.0, drake’s execution environment is the user’s workspace by default. As an upshot, the workspace is vulnerable to side-effects of make(). To protect your workspace, you may want to create a custom evaluation environment containing all your imported objects and then pass it to the envir argument of make(). Here is how.

library(drake)
envir <- new.env(parent = globalenv())
eval(expression({
  f <- function(x){
    g(x) + 1
  }
  g <- function(x){
    x + 1
  }
}
), envir = envir)
myplan <- workplan(out = f(1:3))
make(myplan, envir = envir)
## cache C:/Users/c240390/AppData/Local/Temp/RtmpSQkVhU/Rbuild2db86da6234/drake/...
## connect 2 imports: f, g
## connect 1 target: out
## check 1 item: g
## import g
## check 1 item: f
## import f
## check 1 item: out
## target out
ls() # Check that your workspace did not change.
## [1] "datasets" "envir"    "files"    "myplan"   "rules"
ls(envir) # Check your evaluation environment.
## [1] "f"   "g"   "out"
envir$out
## [1] 3 4 5
readd(out)
## cache C:/Users/c240390/AppData/Local/Temp/RtmpSQkVhU/Rbuild2db86da6234/drake/...
## [1] 3 4 5

2.5 Minimize the side effects of your commands.

Consider the workflow plan data frame below.

my_plan <- workplan(list = c(a = "x <- 1; return(x)"))
my_plan
##   target           command
## 1      a x <- 1; return(x)
deps(my_plan$command[1])
## [1] "return"

Here, x is a mere side effect of the command, and it will not be reproducibly tracked. And if you add a proper target called x to the workflow plan data frame, the results of your analysis may not be correct. Side effects of commands can be unpredictable, so please try to minimize them. It is a good practice to write your commands as function calls. Nested function calls are okay.

2.6 Do not change your working directory.

During the execution workflow of a drake project, please do not change your working directory (with setwd(), for example). At the very least, if you do change your working directory during a command in your workflow plan, please return to the original working directory before the command is completed. Drake relies on a hidden cache (the .drake/ folder) at the root of your project, so navigating to a different folder may confuse drake.

2.7 Take special precautions if your drake project is a package.

Some users like to structure their drake projects as formal R packages. The straightforward way to run such a project is to

  1. Write all your imported functions in *.R files in the package’s R/ folder.
  2. Load the execution environment with devtools::load_all().
  3. Call drake::make().
env <- devtools::load_all("yourProject")$env # Has all your imported functions
drake::make(my_plan, envir = env)            # Run the project normally.

However, the simple strategy above only works for parLapply parallelism with jobs = 1 and mcapply parallelism. For other kinds of parallelism, you must turn devtools::load_all("yourProject")$env into an ordinary environment that does not look like a package namespace. Thanks to Jasper Clarkberg for the following workaround.

  1. Clone devtools::load_all("yourProject")$env in order to change the binding environment of all your functions.
env <- devtools::load_all("yourProject")$env
env <- list2env(as.list(env), parent = globalenv())
  1. Change the enclosing environment of your functions using an unfortunate hack involving environment<-.
for (name in ls(env)){
  assign(
    x = name,
    envir = env,
    value = `environment<-`(get(n, envir = env), env)
  )
}
  1. Make sure drake does not attach yourProject as an external package.
package_name <- "yourProject" # devtools::as.package(".")$package # nolint
packages_to_load <- setdiff(.packages(), package_name)
  1. Run the project with make().
make(
  my_plan, # Prepared in advance
  envir = env,
  parallelism = "Makefile", # Or "parLapply"
  jobs = 2,
  packages = packages_to_load # Does not include "yourProject"
)

You may need to adapt this last workaround, depending on the structure of the package, yourProject.

2.8 Timeouts may be unreliable.

You can call make(..., timeout = 10) to time out all each target after 10 seconds. However, timeouts rely on R.utils::withTimeout(), which in turn relies on setTimeLimit(). These functions are the best that R can offer right now, but they have known issues, and the timeout may fail to take effect for certain environments.

3 Dependencies

3.1 Check your dependencies.

As the user, you should take responsibility for how the steps of your workflow are interconnected. This will affect which targets are built and which ones are skipped. There are several ways to explore the dependency relationship.

load_basic_example()
my_plan
##                    target                                      command
## 1             'report.md'             knit('report.Rmd', quiet = TRUE)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4       regression1_small                                  reg1(small)
## 5       regression1_large                                  reg1(large)
## 6       regression2_small                                  reg2(small)
## 7       regression2_large                                  reg2(large)
## 8  summ_regression1_small suppressWarnings(summary(regression1_small))
## 9  summ_regression1_large suppressWarnings(summary(regression1_large))
## 10 summ_regression2_small suppressWarnings(summary(regression2_small))
## 11 summ_regression2_large suppressWarnings(summary(regression2_large))
## 12 coef_regression1_small              coefficients(regression1_small)
## 13 coef_regression1_large              coefficients(regression1_large)
## 14 coef_regression2_small              coefficients(regression2_small)
## 15 coef_regression2_large              coefficients(regression2_large)
# Hover, click, drag, zoom, and pan. See args 'from' and 'to'.
plot_graph(my_plan, width = "100%", height = "500px")

You can also check the dependencies of individual targets.

deps(reg2)
## [1] "lm"
deps(my_plan$command[1]) # File dependencies like report.Rmd are single-quoted.
## [1] "'report.Rmd'"           "coef_regression2_small"
## [3] "knit"                   "large"                 
## [5] "small"
deps(my_plan$command[nrow(my_plan)])
## [1] "coefficients"      "regression2_large"

List all the reproducibly-tracked objects and files, including imports and targets.

tracked(my_plan, targets = "small")
## connect 9 imports: envir, files, datasets, myplan, simulate, rules, reg1, reg...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## [1] "small"        "simulate"     "data.frame"   "rpois"       
## [5] "stats::rnorm"
tracked(my_plan)
## connect 9 imports: envir, files, datasets, myplan, simulate, rules, reg1, reg...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
##  [1] "'report.md'"            "small"                 
##  [3] "large"                  "regression1_small"     
##  [5] "regression1_large"      "regression2_small"     
##  [7] "regression2_large"      "summ_regression1_small"
##  [9] "summ_regression1_large" "summ_regression2_small"
## [11] "summ_regression2_large" "coef_regression1_small"
## [13] "coef_regression1_large" "coef_regression2_small"
## [15] "coef_regression2_large" "simulate"              
## [17] "reg1"                   "reg2"                  
## [19] "'report.Rmd'"           "knit"                  
## [21] "summary"                "suppressWarnings"      
## [23] "coefficients"           "data.frame"            
## [25] "rpois"                  "stats::rnorm"          
## [27] "lm"

3.2 Dependencies are not tracked in some edge cases.

First of all, if you are ever unsure about what exactly is reproducibly tracked, consult the examples in the following documentation.

?deps
?tracked
?plot_graph

Drake can be fooled into skipping objects that should be treated as dependencies. For example:

f <- function(){
  b <- get("x", envir = globalenv()) # x is incorrectly ignored
  file_dependency <- readRDS('input_file.rds') # 'input_file.rds' is incorrectly ignored # nolint
  digest::digest(file_dependency)
}
deps(f)
## [1] "digest::digest" "get"            "globalenv"      "readRDS"
command <- "x <- digest::digest('input_file.rds'); assign(\"x\", 1); x"
deps(command)
## [1] "'input_file.rds'" "assign"           "digest::digest"

3.3 Dynamic reports

In dynamic knitr reports, you are encouraged to load and read cached targets and imports with the loadd() and readd() functions. In your workflow plan, as long as your command has an explicit reference to knit(), drake will automatically look for active code chunks and figure out the targets you are going to load and read. They are treated as dependencies for the final report.

load_basic_example()
my_plan[1, ]
##        target                          command
## 1 'report.md' knit('report.Rmd', quiet = TRUE)

The R Markdown report loads targets ‘small’, ‘large’, and ‘coef_regression2_small’ using code chunks marked for evaluation.

deps("knit('report.Rmd')")
## [1] "'report.Rmd'"           "coef_regression2_small"
## [3] "knit"                   "large"                 
## [5] "small"
deps("'report.Rmd'") # These are actually dependencies of 'report.md' (output)
## [1] "coef_regression2_small" "large"                 
## [3] "small"

However, you must explicitly mention each and every target loaded into a report. The following examples are discouraged in code chunks because they do not reference any particular target directly or literally in a way that static code analysis can detect.

var <- "good_target"
# Works in isolation, but drake sees "var" literally as a dependency,
# not "good_target".
readd(target = var, character_only = TRUE)
loadd(list = var)
# All cached items are loaded, but none are treated as dependencies.
loadd()
loadd(imports_only = TRUE)

3.4 Functions produced by Vectorize()

With functions produced by Vectorize(), detecting dependencies is especially hard because the body of every such a function is

args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
    character(length(args)) else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
    SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))

Thus, If f <- Vectorize(g, ...) is such a function, drake searches g() for dependencies, not f(). Specifically, if drake sees that environment(f)[["FUN"]] exists and is a function, then environment(f)[["FUN"]] will be searched instead of f().

In addition, if f() is the output of Vectorize(), then drake reacts to changes in environment(f)[["FUN"]], not f(). Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make() still thinks everything is up to date. Also, if you hover over the f node in plot_graph(hover = TRUE), then you will see the body of environment(f)[["FUN"]], not the body of f().

3.5 Compiled code is not reproducibly tracked.

Some R functions use .Call() to run compiled code in the backend. The R code in these functions is tracked, but not the compiled code called with .Call().

3.6 Directories (folders) are not reproducibly tracked.

Yes, you can declare a file target or input file by enclosing it in single quotes in your workflow plan data frame. But entire directories (i.e. folders) cannot yet be tracked this way. Tracking directories is a tricky problem, and lots of individual edge cases need to be ironed out before I can deliver a clean, reliable solution. Please see issue 12 for updates and a discussion.

3.7 Packages are not tracked as dependencies.

Drake may import functions from packages, but the packages themselves are not tracked as dependencies. For this, you will need other tools that support reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

4 High-performance computing

4.1 Maximum number of simultaneous jobs

Be mindful of the maximum number of simultaneous parallel jobs you deploy. At best, too many jobs is poor etiquette on a system with many users and limited resources. At worst, too many jobs will crash a system. The jobs argument to make() sets the maximum number of simultaneous jobs in most cases, but not all.

For most of drake’s parallel backends, jobs sets the maximum number of simultaneous parallel jobs. However, there are ways to break the pattern. For example, make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4") uses at most 2 jobs for the imports and at most 4 jobs for the targets. (In make(), args overrides jobs for the targets). For make(..., parallelism = "future_lapply"), the jobs argument is ignored altogether. Instead, you might limit the max number of jobs by setting options(mc.cores = 2) before calling make(). Depending on the future backend you select with backend() or future::plan(), you might make use of one of the other environment variables listed in ?future::future.options.

4.2 Parallel computing on Windows

On Windows, do not use make(..., parallelism = "mclapply", jobs = n) with n greater than 1. You could try, but jobs will just be demoted to 1. Instead, please replace "mclapply" with one of the other parallelism_choices() or let drake choose the parallelism backend for you. For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

4.3 Configuring future/batchtools-based distributed computing on clusters

The "future_lapply" backend unlocks a large array of distributed computing options on serious computing clusters. However, it is your responsibility to configure your workflow for your specific job scheduler. In particular, special batchtools *.tmpl configuration files are required, and the technique is described in the documentation of batchtools. You can find some examples of these files in the inst/templates folders of the batchtools and future.batchtools GitHub repositories. Drake has some built-in prepackaged example workflows. See examples_drake() to view your options, and then example_drake() to write the files for an example.

example_drake("sge")    # Sun/Univa Grid Engine workflow and supporting files
example_drake("slurm")  # SLURM
example_drake("torque") # TORQUE

Unfortunately, there is no one-size-fits-all *.tmpl configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand how to use your job scheduler and batchtools.

4.4 Proper Makefiles are not standalone.

The Makefile generated by make(myplan, parallelism = "Makefile") is not standalone. Do not run it outside of drake::make(). Drake uses dummy timestamp files to tell the Makefile what to do, and running make in the terminal will most likely give incorrect results.

## cache C:/Users/c240390/AppData/Local/Temp/RtmpSQkVhU/Rbuild2db86da6234/drake/...

4.5 Makefile-level parallelism for imported objects and files

Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best parallel backend for your system and uses the number of jobs you give to the jobs argument to make(). To use at most 2 jobs for imports and at most 4 jobs for targets, run

make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")

4.6 Zombie processes

Some parallel backends, particularly mclapply and future::multicore, may create zombie processes. Zombies are not usually harmful, but you may wish to kill them yourself. The following function by Carl Boneri should work on Unix-like systems. For a discussion, see drake issue 116.

fork_kill_zombies <- function(){
  require(inline)
  includes <- "#include <sys/wait.h>"
  code <- "int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};"

  wait <- inline::cfunction(
    body = code,
    includes = includes,
    convention = ".C"
  )

  invisible(wait())
}

5 Storage

5.1 Storage customization pitfalls

The storage vignette describes how storage works in drake and opens up options for customization. But please do not try to change the short hash algorithm of an existing cache, and beware in-memory caches for parallel computing and persistent projects. See the storage vignette for details.

5.2 Runtime predictions

In predict_runtime() and rate_limiting_times(), drake only accounts for the targets with logged build times. If some targets have not been timed, drake throws a warning and prints the untimed targets.