The tidy tools manifesto

Hadley Wickham

2017-01-27

This document lays out the consistent principles that unify the packages in the tidyverse. The goal of these principles is to provide a uniform interface so that tidyverse packages work together naturally, and once you’ve mastered one, you have a head start on mastering the others.

This is my first attempt at writing down these principles. That means that this manifesto is both aspirational and likely to change heavily in the future. Currently no packages precisely meet the design goals, and while the underlying ideas are stable, I expect their expression in prose will change substantially as I struggle to make explicit my process and thinking.

There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.

There are four basic principles to a tidy API:

  1. Reuse existing data structures.

  2. Compose simple functions with the pipe.

  3. Embrace functional programming.

  4. Design for humans.

Reuse existing data structures

Where possible, re-use existing data structures, rather than creating custom data structures for your own package. Generally, I think it’s better to prefer common existing data structures over custom data structures, even if slightly ill-fitting.

Many R packages (e.g. ggplot2, dplyr, tidyr) work with rectangular datasets made up of observations and variables. If this is true for your package, work with data in either a data frame or tibble. Assume the data is tidy, with variables in the columns, and observations in the rows (see Tidy Data for more details).

Some packages work at a lower level, focussing on a single type of variable. For example, stringr for strings, lubridate for date/times, and forcats for factors. Generally prefer existing base R vector types, but when this not possible, create your own using an S3 class built on top of an atomic vector or list.

If you need “non-standard scoping”, where you refer to variables inside a data frame as if they are in the global environment, prefer formulas over non-standard evaluation. See lazyeval for more details.

Compose simple functions with the pipe

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.

— Hal Abelson

A powerful strategy for solving complex problems is to combine many simple pieces. Each piece should be easily understood in isolation, and have a standard way to combine with other pieces. In R, this strategy plays out by composing single functions with the pipe. The pipe, %>%, is a common composition tool that works across all packages.

Some things to bear in mind when writing functions:

An advantage of a pipeable API is that it is not compulsory: if you do not like using the pipe, you can compose functions in whatever way you prepare. Compare this to an operator-overloading approach (such as the + in ggplot2), or an object composition approach (such as the ... approach in httr).

Embrace functional programming

R is a functional programming language; embrace it, don’t fight it. If you’re familiar with an object-oriented language like Python or C#, this is going to take some adjustment. But in the long run you will be much better off working with the language rather than fighting it.

Generally, this means you should favour:

Design for humans

Programs must be written for people to read, and only incidentally for machines to execute.

— Hal Abelson

Design your API primarily so that it is easy to use by humans. Computer efficiency is a secondary concern because the bottleneck in most data analysis is thinking time, not computing time.