drake

data frames in R for Make

William Michael Landau

2017-05-09

The issue of reproducibility is at the forefront of the R and Statistics community buzz, but there are glaring holes in the conversation and the landscape. Most current talks and tools focus on scientific replicability, code annotation, and version control, and they miss the most crucial part of reproducible data analysis: the promise that the alleged results really do match the code. Drake helps keep this promise by tracking the relationships among the components of the analysis, a rare and effective approach that also saves time. And with multiple parallel computing options that switch on auto-magically, drake is also a convenient and powerful high-performance computing solution.

Acknowledgements and history

The original idea of a time-saving reproducible build system extends back decades to GNU Make, which today helps data scientists as well as the original user base of complied-language programmers. More recently, Rich FitzJohn created remake, a breakthrough reimagining of Make for R and the most important inspiration for drake. Drake is a fresh reinterpretation of some of remake’s pioneering fundamental concepts, scaled up for computationally-demanding workflows.

Thanks also to Kirill Müller and Daniel Falster. They contributed code patches and enhancement ideas to my parallelRemake and remakeGenerator packages, which I have now subsumed into drake.

Advantages of drake

Some advantages of drake, especially relative to remake at the time of writing, are

  1. reproducibility. Drake can
    • import and track arbitrary variables from your workspace.
    • track functions from packages.
  2. safety.
    • If you manually break an output file target, drake can fix it with another call to make().
  3. parallel computing. Auto-magically run multiple targets simultaneously.
    • Switch on the mclapply() or parLapply() backend for light parallelism within a single R session.
    • Switch on a Makefile to parallelize over multiple R sessions, which you can distribute across several nodes of a computing cluster or a supercomputer.
  4. convenience.
    • No YAML files!
    • With functions like plan(), datasets(), analyses(), evaluate(), expand(), and gather(), you can minimize typing when you set up a project.

Windows

Drake presents mclapply() as one of two single-session parallel computing backends. Unfortunately, mclapply() cannot run multiple parallel jobs on Windows, so Windows users should use set parallelism = "parLapply" rather than parallelism = "mclapply" inside make() (already the Windows default). For true distributed parallel computing over multiple R sessions, Windows users need to download and install Rtools. This is because drake runs Makefiles with system2("make", ...).

Tutorials

The CRAN page links to multiple tutorials and vignettes. With drake installed, you can load any of the vignettes in an R session.

vignette(package = "drake") # List the vignettes.
vignette("drake") # High-level intro.
vignette("quickstart") # Walk through a simple example.
vignette("caution") # Drake is not perfect. Read this to be safe.

Quickstart examples

Drake has small self-contained built-in examples. To see the names of the available examples, use

examples_drake()
## [1] "basic"

Then use example_drake() to write the files for the example to your working directory.

example_drake("basic")

Step through the code files to get started.

Words of caution

Drake tries to reproducibly track everything and make other obviously good decisions, but there are limitations. For example, in some edge cases, it is possible to trick drake into ignoring dependencies. Please read the “caution” vignette to use drake safely (vignette("caution"), also linked from the CRAN page under “vignettes”).

Help and troubleshooting

For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.