Prototypes and sizes

Rather than using class() and length(), vctrs has notions of prototype (vec_ptype()) and size (vec_size()). This vignette motivates why these alternatives are necessary, and connects their definitions to type coercion and the recycling rules.

Size and prototype are motivated by thinking about the optimal behaviour for c() and rbind(), particularly inspired by data frames with columns that are matrices or data frames.

library(vctrs)

Prototype

The idea of a prototype is to capture the metadata associated with a vector, without capturing any data. Unfortunately, the class() of an object is inadequate for this purpose:

Instead vctrs takes advantage of R’s vectorised nature and uses a prototype, a 0-observation slice of the vector (this is basically x[0] but with some subtleties we’ll come back to later) . This is a miniature version of the vector that contains all of the attributes but none of the data.

Conveniently, you can create many prototypes using existing base functions (e.g, double(), factor(levels = c("a", "b"))). vctrs provides a few helpers (e.g. new_date(), new_datetime(), new_duration()) where the equivalents in base R are missing.

Base prototypes

vec_type() creates a prototype from an existing object. However, many base vectors have uninformative printing methods for 0-length subsets, so vctrs also provides vec_ptype(), which prints the prototype in a friendly way (and returns nothing).

Using vec_ptype() allows us to see the prototypes base R classes:.

Coercing to common type

It’s often important to combine vectors with multiple types. vctrs provides a consistent set of rules for coercion, via vec_type_common(). vec_type_common() possesses the following invariants:

i.e. vec_type_common() is both commutative and associative (with respect to class), and has an identity element, NULL, i.e. it’s a commutative monoid. This means the underlying implementation is quite simple: we can find the common type of any number of objects by progressively finding the common type of pairs of objects.

Like with vec_type(), the easiest way to explore vec_type_common() is with vec_ptype(): when given multiple inputs, it will print their common prototype. (In other words: program with vec_type_common() but play with vec_ptype().)

Casting to specified type

vec_type_common() finds the common type of a set of vector. Typically, however, what you want is a set of vectors coerced to that common type. That’s the job of vec_cast_common():

Alternatively, you can cast to a specific prototype using vec_cast():

If a cast is possible in general (i.e. double -> integer), but information is lost for a specific input (e.g. 1.5 -> 1), it will generate a warning.

The set of casts is more permissive than the set of coercions and is summarised in the diagram below. Coercions are shown by arrows; possible casts are shown with circles.

Summary of vctrs casting rules

Summary of vctrs casting rules

Size

vec_size() was motivated by the need to have an invariant that describes the number of “observations” in a data structure. This is particularly important for data frames as it’s useful to have some function such that f(data.frame(x)) equals f(x). No base function has this property:

We define vec_size() as follows:

Given vec_size(), we can give a precise definition of a data frame: a data frame is a list of vectors where every vector has the same size. This has the desirable property of trivially supporting matrix and data frame columns.

Slicing

vec_slice() is to vec_size() as [ is to length(); i.e. it allows you to select observations, regardless of the dimensionality of the underlying object. vec_slice(x, i) is equivalent to:

vec_slice(data.frame(x), i) equals data.frame(vec_slice(x, i)) (modulo variable and row names).

Prototypes are generated with vec_slice(x, 0L); given a prototype, you can generate a vector of given size (filled with NAs) with vec_na()

Common sizes: recycling rules

Closely related to the definition of size are the recycling rules. The recycling rules determine the size of the output when two vectors of different sizes are combined. In vctrs, the recycling rules are encoded in vec_size_common() which give the common size of a set of vectors:

vctrs obeys a stricter set of recycling rules than base R, only recycling under two circumstances:

All other size combinations will generate an error. This strictness prevents common mistakes like dest == c("IAH", "HOU")), at the cost of occasionally requiring an explicit calls to rep().

Summary of vctrs recycling rules. X indicates n error

Summary of vctrs recycling rules. X indicates n error

You can apply the recycling rules in two ways:

Appendix: recycling in base R

The recycling rules in base R are described in The R Language Definition but are not implemented in a single function, and thus are not applied consistently. Here I give a brief overview of their most common realisation, as well as showing some of the exceptions.

Generally, in base R, when a pair of vectors is not the same length, the shorter vector is recycled to the same length as the longer:

rep(1, 6) + 1
#> [1] 2 2 2 2 2 2
rep(1, 6) + 1:2
#> [1] 2 3 2 3 2 3
rep(1, 6) + 1:3
#> [1] 2 3 4 2 3 4

If the length of the longer vector is not an integer multiple of the length of the shorter, you usually get a warning:

invisible(pmax(1:2, 1:3))
#> Warning in pmax(1:2, 1:3): an argument will be fractionally recycled
invisible(1:2 + 1:3)
#> Warning in 1:2 + 1:3: longer object length is not a multiple of shorter
#> object length
invisible(cbind(1:2, 1:3))
#> Warning in cbind(1:2, 1:3): number of rows of result is not a multiple of
#> vector length (arg 1)

But some functions recycle silently:

length(atan2(1:3, 1:2))
#> [1] 3
length(paste(1:3, 1:2))
#> [1] 3
length(ifelse(1:3, 1:2, 1:2))
#> [1] 3

And data.frame() throws an error:

data.frame(1:2, 1:3)
#> Error in data.frame(1:2, 1:3): arguments imply differing number of rows: 2, 3

The R language definition states that “any arithmetic operation involving a zero-length vector has a zero-length result”. But outside of arithmetic, this rule is not consistently followed:

# length-0 output
1:2 + integer()
#> integer(0)
atan2(1:2, integer())
#> numeric(0)
pmax(1:2, integer())
#> integer(0)

# dropped
cbind(1:2, integer())
#>      [,1]
#> [1,]    1
#> [2,]    2

# recycled to length of first
ifelse(rep(TRUE, 4), integer(), character())
#> [1] NA NA NA NA

# preserved-ish
paste(1:2, integer())
#> [1] "1 " "2 "

# Errors
data.frame(1:2, integer())
#> Error in data.frame(1:2, integer()): arguments imply differing number of rows: 2, 0