This vignette addresses drake's known edge cases, pitfalls, and weaknesses that might not be fixed in future releases. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version of drake. For a guide to debugging and testing drake projects, please refer to the separate “debug” vignette.

Projects built with drake <= 4.4.0 are not back compatible with drake > 4.4.0.

Versions of drake after 4.4.0 have different caching internals. That means if you built your project with drake 4.4.0 or earlier, later versions of drake will not let you call make(), which could mangle your work. To migrate your project to a later drake, you have three options.

  1. Revert to a back-compatible version of drake with devtools::install_version("drake", "4.4.0"). Here, the devtools package must be installed (install.packages("devtools")).
  2. Run your project from scratch with make(..., force = TRUE).
  3. Use migrate_drake_project() to convert your project to the new format. The migrate_drake_project() function
    1. copies your old project to a backup folder.
    2. converts your project's cache to a format compatible with drake 5.0.0 and later.
    3. informs you whether the migration succeeded: that is, whether outdated targets remained outdated and up-to-date targets remain up to date after migration.

Workflow plans

Externalizing commands in R script files

It is common practice to divide the work of a project into multiple R files, but if you do this, you will not get the most out of drake. Please see the best practices vignette for more details.

Commands are NOT perfectly flexible.

In your workflow plan data frame (produced by drake_plan() and accepted by make()), your commands can usually be flexible R expressions.

drake_plan(
  target1 = 1 + 1 - sqrt(sqrt(3)),
  target2 = my_function(web_scraped_data) %>% my_tidy
)
## # A tibble: 2 x 2
##   target  command                                  
##   <chr>   <chr>                                    
## 1 target1 1 + 1 - sqrt(sqrt(3))                    
## 2 target2 my_function(web_scraped_data) %>% my_tidy

However, please try to avoid formulas and function definitions in your commands. You may be able to get away with drake_plan(f = function(x){x + 1}) or drake_plan(f = y ~ x) in some use cases, but be careful. It is generally to define functions and formulas in your workspace and then let make() import them. (Alternatively, use the envir argument to make() to tightly control which imported functions are available.) Use the check_plan() function to help screen and quality-control your workflow plan data frame, use tracked() to see the items that are reproducibly tracked, and use vis_drake_graph() and build_drake_graph() to see the dependency structure of your project.

Execution

Install drake properly.

You must properly install drake using install.packages(), devtools::install_github(), or a similar approach. Functions like devtools::load_all() are insufficient, particularly for parallel computing functionality in which separate new R sessions try to require(drake).

Install all your packages.

Your workflow may depend on external packages such as ggplot2, dplyr, and MASS. Such packages must be formally installed with install.packages(), devtools::install_github(), devtools::install_local(), or a similar command. If you load uninstalled packages with devtools::load_all(), results may be unpredictable and incorrect.

A note on tidy evaluation

Running commands in your R console is not always exactly like running them with make(). That's because make() uses tidy evaluation as implemented in the rlang package.

# This workflow plan uses rlang's quasiquotation operator `!!`.
my_plan <- drake_plan(list = c(
  little_b = "\"b\"",
  letter = "!!little_b"
))
my_plan
## # A tibble: 2 x 2
##   target   command   
##   <chr>    <chr>     
## 1 little_b "\"b\""   
## 2 letter   !!little_b
make(my_plan)
## target little_b
## target letter
readd(letter)
## [1] "b"

For the commands you specify the free-form ... argument, drake_plan() also supports tidy evaluation. For example, it supports quasiquotation with the !! argument. Use tidy_evaluation = FALSE or the list argument to suppress this behavior.

my_variable <- 5

drake_plan(
  a = !!my_variable,
  b = !!my_variable + 1,
  list = c(d = "!!my_variable")
)
## # A tibble: 3 x 2
##   target command      
##   <chr>  <chr>        
## 1 a      5            
## 2 b      5 + 1        
## 3 d      !!my_variable

drake_plan(
  a = !!my_variable,
  b = !!my_variable + 1,
  list = c(d = "!!my_variable"),
  tidy_evaluation = FALSE
)
## # A tibble: 3 x 2
##   target command            
##   <chr>  <chr>              
## 1 a      !(!my_variable)    
## 2 b      !(!my_variable + 1)
## 3 d      !!my_variable

For instances of !! that remain in the workflow plan, make() will run these commands in tidy fashion, evaluating the !! operator using the environment you provided.

Find and diagnose your errors.

When make() fails, use failed() and diagnose() to debug. Try the following out yourself.

# Targets with available diagnostic metadata, incluing errors, warnings, etc.
diagnose()
## [1] "letter"   "little_b"

f <- function(){
  stop("unusual error")
}

bad_plan <- drake_plan(target = f())

withr::with_message_sink(
  stdout(),
  make(bad_plan)
)
## target target
## fail target
## Error: Target `target`` failed. Call `diagnose(target)` for details. Error message:
##   unusual error
## Warning: No message sink to remove.

failed() # From the last make() only
## [1] "target"

error <- diagnose(target)$error # See also warnings and messages.

error$message
## [1] "unusual error"

error$call
## f()

error$calls # View the traceback.
## [[1]]
## local({
##     f()
## })
## 
## [[2]]
## eval.parent(substitute(eval(quote(expr), envir)))
## 
## [[3]]
## eval(expr, p)
## 
## [[4]]
## eval(expr, p)
## 
## [[5]]
## eval(quote({
##     f()
## }), new.env())
## 
## [[6]]
## eval(quote({
##     f()
## }), new.env())
## 
## [[7]]
## f()
## 
## [[8]]
## stop("unusual error")

Your workspace is modified by default.

As of version 3.0.0, drake's execution environment is the user's workspace by default. As an upshot, the workspace is vulnerable to side-effects of make() (which are generally limited to loading and unloading targets). To protect your workspace, you may want to create a custom evaluation environment containing all your imported objects and then pass it to the envir argument of make(). Here is how.

library(drake)
clean(verbose = FALSE)
envir <- new.env(parent = globalenv())
eval(
  expression({
    f <- function(x){
      g(x) + 1
    }
    g <- function(x){
      x + 1
    }
  }
  ),
  envir = envir
)
myplan <- drake_plan(out = f(1:3))

make(myplan, envir = envir)
## target out

ls() # Check that your workspace did not change.
##  [1] "AES"               "AESdecryptECB"     "AESencryptECB"    
##  [4] "AESinit"           "attr_sha1"         "bad_plan"         
##  [7] "cache"             "cranlogs_plan"     "digest"           
## [10] "digest_impl"       "envir"             "error"            
## [13] "f"                 "g"                 "get_logs"         
## [16] "good_plan"         "hard_plan"         "hmac"             
## [19] "letter"            "little_b"          "logs"             
## [22] "makeRaw"           "makeRaw.character" "makeRaw.default"  
## [25] "makeRaw.digest"    "makeRaw.raw"       "modes"            
## [28] "my_plan"           "my_variable"       "myplan"           
## [31] "new_objects"       "num2hex"           "padWithZeros"     
## [34] "plan"              "print.AES"         "query"            
## [37] "random_rows"       "reg1"              "reg2"             
## [40] "rules_grid"        "sha1"              "sha1.Date"        
## [43] "sha1.NULL"         "sha1.POSIXct"      "sha1.POSIXlt"     
## [46] "sha1.anova"        "sha1.array"        "sha1.call"        
## [49] "sha1.character"    "sha1.complex"      "sha1.data.frame"  
## [52] "sha1.default"      "sha1.factor"       "sha1.function"    
## [55] "sha1.integer"      "sha1.list"         "sha1.logical"     
## [58] "sha1.matrix"       "sha1.name"         "sha1.numeric"     
## [61] "sha1.pairlist"     "sha1.raw"          "simulate"         
## [64] "timestamp"         "tmp"               "url"              
## [67] "x"

ls(envir) # Check your evaluation environment.
## [1] "f"   "g"   "out"

envir$out
## [1] 3 4 5

readd(out)
## [1] 3 4 5

Refresh the drake_config() list early and often.

The master configuration list returned by drake_config() is important to drake's internals, and you will need it for functions like outdated() and vis_drake_graph(). The config list corresponds to a single call to make(), and you should not modify it by hand afterwards. For example, modifying the targets element post-hoc will have no effect because the graph element will remain the same. It is best to just call drake_config() again.

Workflows as R packages.

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

See the best practices vignette and ?expose_imports for detailed explanations. Thanks to Jasper Clarkberg for the workaround.

The lazy_load flag does not work with "parLapply" parallelism.

Ordinarily, drake prunes the execution environment at every parallelizable stage. In other words, it loads all the dependencies and unloads anything superfluous for entire batches of targets. This approach may require too much memory for some use cases, so there is an option to delay the loading of dependencies using the lazy_load argument to make() (powered by delayedAssign()). There are two major risks.

  1. make(..., lazy_load = TRUE, parallelism = "parLapply", jobs = 2) does not work. If you want to use local multisession parallelism with multiple jobs and lazy loading, try "future_lapply" parallelism instead.

    library(future)
    future::plan(multisession)
    load_basic_example() # Get the code with drake_example("basic").
    make(my_plan, lazy_load = TRUE, parallelism = "future_lapply")
    
  2. Delayed evaluation may cause the same dependencies to be loaded multiple times, and these duplicated loads could be slow.

Timeouts may be unreliable.

You can call make(..., timeout = 10) to time out all each target after 10 seconds. However, timeouts rely on R.utils::withTimeout(), which in turn relies on setTimeLimit(). These functions are the best that R can offer right now, but they have known issues, and timeouts may fail to take effect for certain environments.

Dependencies

Dependency objects containing functions may change unexpectedly

For example, an R6 class changes whenever a new R6 object of that class is created.

library(digest)
library(R6)
example_class <- R6Class(
  "example_class",
  private = list(data = list()),
  public = list(
    initialize = function(data = list()) {
      private$data = data
    }
  )
)
digest(example_class)
## [1] "d123d15bb7cd05363355cb769afa2e7d"
example_object <- example_class$new(data = 1234)
digest(example_class) # example_class changed
## [1] "9f21372d350984e5e1b0a77e1103f2cd"

Drake detects this type strange change and unavoidably rebuilds targets.

plan <- drake_plan(example_target = example_class$new(1234))
make(plan) # `example_class` changes because it is referenced.
make(plan) # Builds `example_target` again because `example_class` changed.

The same behavior affects objects containing functions more broadly, not just R6 classes. For more on this edge case, see issue 345.

The magrittr dot (.) is always ignored.

It is common practice to use a literal dot (.) to carry an object through a magrittr pipeline. Due to some tricky limitations in static code analysis, drake never treats the dot (.) as a dependency, even if you use it as an ordinary variable outside of a tidyverse context.

deps("sqrt(x + y + .)")
## [1] "sqrt" "x"    "y"
deps("dplyr::filter(complete.cases(.))")
## [1] "complete.cases" "dplyr::filter"

Triggers and skipped imports

With alternate triggers and the option to skip imports, you can sacrifice reproducibility to gain speed. However, these options can throw the dependency network out of sync. You should only use them for testing and debugging.

Dependencies are not tracked in some edge cases.

You should explicitly learn the items in your workflow and the dependencies of your targets.

?deps
?tracked
?vis_drake_graph

Drake can be fooled into skipping objects that should be treated as dependencies. For example:

f <- function(){
  b <- get("x", envir = globalenv()) # x is incorrectly ignored
  digest::digest(file_dependency)
}

deps(f)
## [1] "digest::digest"  "file_dependency" "get"             "globalenv"

command <- "x <- digest::digest(file_in(\"input_file.rds\")); assign(\"x\", 1); x" # nolint
deps(command)
## [1] "\"input_file.rds\"" "assign"             "digest::digest"

Drake takes special precautions so that a target/import does not depend on itself. For example, deps(f) might return "f" if f() is a recursive function, but make() just ignores this conflict and runs as expected. In other words, make() automatically removes all self-referential loops in the dependency network.

Dependencies of knitr reports

If you have knitr reports, you can use knitr_report() in your commands so that your reports are refreshed every time one of their dependencies changes. See drake_example("basic") and the ?knitr_in() help file examples for demonstrations. Dependencies are detected if you call loadd() or readd() in your code chunks. But beware: an empty call to loadd() does not account for any dependencies even though it loads all the available targets into your R session.

Knitr inputs and file outputs in imported functions.

Similarly, file_out() in a command tells drake that the result of a command is an output file. Clearly, this is inappropriate for imported functions, but drake will look anyway. Do not try to use file_out() or knitr_in() inside your custom imported functions. Drake only pays attention to them in proper commands in your workflow plan data frame. However, file_in() is perfectly fine if your imported function needs a file in order to run. Here are some examples.

# toally_fine() will depend on the imported data.csv file.
# But make sure data.csv is an imported file and not a file target.
totally_okay <- function(x, y, z){
  read.csv(file_in("data.csv"))
}

# file_out() is for file targets, so `drake` will ignore it.
avoid_this <- function(x, y, z){
  read.csv(file_out("data.csv"))
}

# knitr_in() is for knitr files with dependencies
# in their active code chunks (explicitly referenced with loadd() and readd().
# Drake just treats knitr_in() as an ordinary file input in this case.
# You should really be using file_in() instead.
avoid_this <- function(x, y, z){
  read.csv(knitr_in("report.Rmd"))
}

Functions produced by Vectorize()

With functions produced by Vectorize(), detecting dependencies is especially hard because the body of every such function is

args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
    character(length(args)) else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
    SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))

Thus, if f is constructed with Vectorize(g, ...), drake searches g() for dependencies, not f(). In fact, if drake sees that environment(f)[["FUN"]] exists and is a function, then environment(f)[["FUN"]] will be analyzed instead of f(). Furthermore, if f() is the output of Vectorize(), then drake reproducibly tracks environment(f)[["FUN"]] rather than f() itself. Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make() will not react. Also, if you hover over the f node in vis_drake_graph(hover = TRUE), then you will see the body of environment(f)[["FUN"]], not the body of f().

Compiled code is not reproducibly tracked.

Some R functions use .Call() to run compiled code in the backend. The R code in these functions is tracked, but not the compiled object called with .Call(), nor its C/C++/Fortran source.

Directories (folders) are not reproducibly tracked.

In your workflow plan, you can use file_in(), file_out(), and knitr_in() to assert that some targets/imports are external files. However, entire directories (i.e. folders) cannot be reproducibly tracked this way. Please see issue 12 for a discussion.

Packages are not tracked as dependencies.

Drake may import functions from packages, but the packages themselves are not tracked as dependencies. For this, you will need other tools that support reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

High-performance computing

The practical utility of parallel computing

Drake claims that it can

  1. Build and cache your targets in parallel (in stages).
  2. Build and cache your targets in the correct order, finishing dependencies before starting targets that depend on them.
  3. Deploy your targets to the parallel backend of your choice.

However, the practical efficiency of the parallel computing functionality remains to be verified rigorously. Serious performance studies will be part of future work that has not yet been conducted at the time of writing. In addition, each project has its own best parallel computing set up, and the user needs to optimize it on a case-by-case basis. Some general considerations include the following.

Maximum number of simultaneous jobs

Be mindful of the maximum number of simultaneous parallel jobs you deploy. At best, too many jobs is poor etiquette on a system with many users and limited resources. At worst, too many jobs will crash a system. The jobs argument to make() sets the maximum number of simultaneous jobs in most cases, but not all.

For most of drake's parallel backends, jobs sets the maximum number of simultaneous parallel jobs. However, there are ways to break the pattern. For example, make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4") uses at most 2 jobs for the imports and at most 4 jobs for the targets. (In make(), args overrides jobs for the targets). For make(..., parallelism = "future_lapply"), the jobs argument is ignored altogether. Instead, you should set the workers argument where it is available (for example, future::plan(mutlisession(workers = 2)) or future::plan(future.batchtools::batchtools_local(workers = 2))) in the preparations before make(). Alternatively, you might limit the max number of jobs by setting options(mc.cores = 2) before calling make(). Depending on the future backend you select with future::plan() or future::plan(), you might make use of one of the other environment variables listed in ?future::future.options.

Parallel computing on Windows

On Windows, do not use make(..., parallelism = "mclapply", jobs = n) with n greater than 1. You could try, but jobs will just be demoted to 1. Instead, please replace "mclapply" with one of the other parallelism_choices() or let drake choose the parallelism backend for you. For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

Configuring future/batchtools parallelism for clusters

The "future_lapply" backend unlocks a large array of distributed computing options on serious computing clusters. However, it is your responsibility to configure your workflow for your specific job scheduler. In particular, special batchtools *.tmpl configuration files are required, and the technique is described in the documentation of batchtools. You can find some examples of these files in the inst/templates folders of the batchtools and future.batchtools GitHub repositories. Drake has some built-in prepackaged example workflows. See drake_examples() to view your options, and then drake_example() to write the files for an example.

drake_example("sge")    # Sun/Univa Grid Engine workflow and supporting files
drake_example("slurm")  # SLURM
drake_example("torque") # TORQUE

To write just *.tmpl files from these examples, see the drake_batchtools_tmpl_file() function.

Unfortunately, there is no one-size-fits-all *.tmpl configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand your job scheduler and batchtools.

Proper Makefiles are not standalone.

The Makefile generated by make(myplan, parallelism = "Makefile") is not standalone. Do not run it outside of drake::make(). Drake uses dummy timestamp files to tell the Makefile what to do, and running make in the terminal will most likely give incorrect results.

Makefile-level parallelism for imported objects and files

Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best local parallel backend for your system and uses the jobs argument to make(). To use at most 2 jobs for imports and at most 4 jobs for targets, run

make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")

Zombie processes

Some parallel backends, particularly mclapply and future::multicore, may create zombie processes. Zombie children are not usually harmful, but you may wish to kill them yourself. The following function by Carl Boneri should work on Unix-like systems. For a discussion, see drake issue 116.

fork_kill_zombies <- function(){
  require(inline)
  includes <- "#include <sys/wait.h>"
  code <- "int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};"

  wait <- inline::cfunction(
    body = code,
    includes = includes,
    convention = ".C"
  )

  invisible(wait())
}

Storage

Projects hosted on Dropbox and similar platforms

If download a drake project from Dropbox, you may get an error like the one in issue 198:


cache pathto/.drake
connect 61 imports: ...
connect 200 targets: ...
Error in rawToChar(as.raw(x)) : 
  embedded nul in string: 'initial_drake_version\0\0\x9a\x9d\xdc\0J\xe9\0\0\0(\x9d\xf9brם\0\xca)\0\0\xb4\xd7\0\0\0\0\xb9'
In addition: Warning message:
In rawToChar(as.raw(x)) :
  out-of-range values treated as 0 in coercion to raw

This is probably because Dropbox generates a bunch of “conflicted copy” files when file transfers do not go smoothly. This confuses storr, drake's caching backend.


keys/config/aG9vaw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/am9icw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/c2VlZA (Sandy Sum's conflicted copy 2018-01-31)

Just remove these files and proceed with your work.

Cache customization is limited

The storage vignette describes how storage works in drake. As explained near the end of that vignette, you can plug custom storr caches into make(). However, non-RDS caches such as storr_dbi() may not work with most forms of parallel computing. The storr::storr_dbi() cache and many others are not thread-safe. Either use no parallel computing at all or set parallelism = "future" with caching = "master". The "future" backend is currently experimental, but it allows the master process to do all the caching in order to avoid race conditions.

Runtime predictions

In predict_runtime() and rate_limiting_times(), drake only accounts for the targets with logged build times. If some targets have not been timed, drake throws a warning and prints the untimed targets.