Extending tibble

Kirill Müller, Hadley Wickham

2018-01-22

To extend the tibble package for new types of columnar data, you need to understand how printing works. The presentation of a column in a tibble is powered by four S3 generics:

If you have written an S3 or S4 class that can be used as a column, you can override these generics to make sure your data prints well in a tibble. To start, you must import the pillar package that powers the printing of tibbles. Either add pillar to the Imports: section of your DESCRIPTION, or simply call:

usethis::use_package("pillar")

This short vignette assumes a package that implements an S3 class "latlon" and uses roxygen2 to create documentation and the NAMESPACE file. For this vignette to work we need to attach pillar:

Prerequisites

We define a class "latlon" that encodes geographic coordinates in a complex number. For simplicity, the values are printed as degrees and minutes only.

#' @export
latlon <- function(lat, lon) {
  as_latlon(complex(real = lon, imaginary = lat))
}

#' @export
as_latlon <- function(x) {
  structure(x, class = "latlon")
}

#' @export
c.latlon <- function(x, ...) {
  as_latlon(NextMethod())
}

#' @export
`[.latlon` <- function(x, i) {
  as_latlon(NextMethod())
}

#' @export
format.latlon <- function(x, ..., formatter = deg_min) {
  x_valid <- which(!is.na(x))

  lat <- unclass(Im(x[x_valid]))
  lon <- unclass(Re(x[x_valid]))

  ret <- rep("<NA>", length(x))
  ret[x_valid] <- paste(
    formatter(lat, c("N", "S")),
    formatter(lon, c("E", "W"))
  )
  format(ret, justify = "right")
}

deg_min <- function(x, pm) {
  sign <- sign(x)
  x <- abs(x)
  deg <- trunc(x)
  x <- x - deg
  min <- round(x * 60)

  ret <- sprintf("%d°%.2d'%s", deg, min, pm[ifelse(sign >= 0, 1, 2)])
  format(ret, justify = "right")
}

#' @export
print.latlon <- function(x, ...) {
  cat(format(x), sep = "\n")
  invisible(x)
}

latlon(32.7102978, -117.1704058)
## 32°43'N 117°10'W

More methods are needed to make this class fully compatible with data frames, see e.g. the hms package for a more complete example.

Using in a tibble

Columns on this class can be used in a tibble right away, but the output will be less than ideal:

library(tibble)
data <- tibble(
  venue = "rstudio::conf",
  year  = 2017:2019,
  loc   = latlon(
    c(28.3411783, 32.7102978, NA),
    c(-81.5480348, -117.1704058, NA)
  ),
  paths = list(
    loc[1],
    c(loc[1], loc[2]),
    loc[2]
  )
)

data
## # A tibble: 3 x 4
##   venue          year loc                      paths       
##   <chr>         <int> <S3: latlon>             <list>      
## 1 rstudio::conf  2017 -81.5480348+28.3411783i  <S3: latlon>
## 2 rstudio::conf  2018 -117.1704058+32.7102978i <S3: latlon>
## 3 rstudio::conf  2019 <NA>                     <S3: latlon>

(The paths column is a list that contains arbitrary data, in our case latlon vectors. A list column is a powerful way to attach hierarchical or unstructured data to an observation in a data frame.)

The output has three main problems:

  1. The column type of the loc column is displayed as <S3: latlon>. This default formatting works reasonably well for any kind of object, but the generated output may be too wide and waste precious space when displaying the tibble.
  2. The values in the loc column are formatted as complex numbers (the underlying storage), without using the format() method we have defined. This is by design.
  3. The cells in the paths column are also displayed as <S3: latlon>.

In the remainder I’ll show how to fix these problems, and also how to implement rendering that adapts to the available width.

Fixing the data type

To display <geo> as data type, we need to override the type_sum() method. This method should return a string that can be used in a column header. For your own classes, strive for an evocative abbreviation that’s under 6 characters.

#' @importFrom pillar type_sum
#' @export
type_sum.latlon <- function(x) {
  "geo"
}

Because the value shown there doesn’t depend on the data, we just return a constant. (For date-times, the column info will eventually contain information about the timezone, see #53.)

data
## # A tibble: 3 x 4
##   venue          year loc                      paths 
##   <chr>         <int> <geo>                    <list>
## 1 rstudio::conf  2017 -81.5480348+28.3411783i  <geo> 
## 2 rstudio::conf  2018 -117.1704058+32.7102978i <geo> 
## 3 rstudio::conf  2019 <NA>                     <geo>

Rendering the value

To use our format method for rendering, we implement the pillar_shaft() method for our class. (A pillar is mainly a shaft (decorated with an ornament), with a capital above and a base below. Multiple pillars form a colonnade, which can be stacked in multiple tiers. This is the motivation behind the names in our API.)

#' @importFrom pillar pillar_shaft
#' @export
pillar_shaft.latlon <- function(x, ...) {
  out <- format(x)
  out[is.na(x)] <- NA
  pillar::new_pillar_shaft_simple(out, align = "right")
}

The simplest variant calls our format() method, everything else is handled by pillar, in particular by the new_pillar_shaft_simple() helper. Note how the align argument affects the alignment of NA values and of the column name and type.

data
## # A tibble: 3 x 4
##   venue          year              loc paths 
##   <chr>         <int>            <geo> <list>
## 1 rstudio::conf  2017 28°20'N  81°33'W <geo> 
## 2 rstudio::conf  2018 32°43'N 117°10'W <geo> 
## 3 rstudio::conf  2019               NA <geo>

We could also use left alignment and indent only the NA values:

#' @importFrom pillar pillar_shaft
#' @export
pillar_shaft.latlon <- function(x, ...) {
  out <- format(x)
  out[is.na(x)] <- NA
  pillar::new_pillar_shaft_simple(out, align = "left", na_indent = 5)
}

data
## # A tibble: 3 x 4
##   venue          year loc              paths 
##   <chr>         <int> <geo>            <list>
## 1 rstudio::conf  2017 28°20'N  81°33'W <geo> 
## 2 rstudio::conf  2018 32°43'N 117°10'W <geo> 
## 3 rstudio::conf  2019      NA          <geo>

Adaptive rendering

If there is not enough space to render the values, the formatted values are truncated with an ellipsis. This doesn’t currently apply to our class, because we haven’t specified a minimum width for our values:

print(data, width = 35)
## # A tibble: 3 x 4
##   venue      year loc             
##   <chr>     <int> <geo>           
## 1 rstudio:…  2017 28°20'N  81°33'W
## 2 rstudio:…  2018 32°43'N 117°10'W
## 3 rstudio:…  2019      NA         
## # ... with 1 more variable:
## #   paths <list>

If we specify a minimum width when constructing the shaft, the loc column will be truncated:

#' @importFrom pillar pillar_shaft
#' @export
pillar_shaft.latlon <- function(x, ...) {
  out <- format(x)
  out[is.na(x)] <- NA
  pillar::new_pillar_shaft_simple(out, align = "right", min_width = 10)
}

print(data, width = 35)
## # A tibble: 3 x 4
##   venue     year         loc paths
##   <chr>    <int>       <geo> <lis>
## 1 rstudio…  2017 28°20'N  8… <geo>
## 2 rstudio…  2018 32°43'N 11… <geo>
## 3 rstudio…  2019          NA <geo>

This may be useful for character data, but for lat-lon data we may prefer to show full degrees and remove the minutes if the available space is not enough to show accurate values. A more sophisticated implementation of the pillar_shaft() method is required to achieve this:

#' @importFrom pillar pillar_shaft
#' @export
pillar_shaft.latlon <- function(x, ...) {
  deg <- format(x, formatter = deg)
  deg[is.na(x)] <- pillar::style_na("NA")
  deg_min <- format(x)
  deg_min[is.na(x)] <- pillar::style_na("NA")
  pillar::new_pillar_shaft(
    list(deg = deg, deg_min = deg_min),
    width = pillar::get_max_extent(deg_min),
    min_width = pillar::get_max_extent(deg),
    subclass = "pillar_shaft_latlon"
  )
}

Here, pillar_shaft() returns an object of the "pillar_shaft_latlon" class created by the generic new_pillar_shaft() constructor. This object contains the necessary information to render the values, and also minimum and maximum width values. For simplicity, both formattings are pre-rendered, and the minimum and maximum widths are computed from there. Note that we also need to take care of NA values explicitly. (get_max_extent() is a helper that computes the maximum display width occupied by the values in a character vector.)

For completeness, the code that implements the degree-only formatting looks like this:

deg <- function(x, pm) {
  sign <- sign(x)
  x <- abs(x)
  deg <- round(x)

  ret <- sprintf("%d°%s", deg, pm[ifelse(sign >= 0, 1, 2)])
  format(ret, justify = "right")
}

All that’s left to do is to implement a format() method for our new "pillar_shaft_latlon" class. This method will be called with a width argument, which then determines which of the formattings to choose:

#' @export
format.pillar_shaft_latlon <- function(x, width, ...) {
  if (all(crayon::col_nchar(x$deg_min) <= width)) {
    ornament <- x$deg_min
  } else {
    ornament <- x$deg
  }

  pillar::new_ornament(ornament)
}

data
## # A tibble: 3 x 4
##   venue          year loc              paths 
##   <chr>         <int> <geo>            <list>
## 1 rstudio::conf  2017 28°20'N  81°33'W <geo> 
## 2 rstudio::conf  2018 32°43'N 117°10'W <geo> 
## 3 rstudio::conf  2019 NA               <geo>
print(data, width = 35)
## # A tibble: 3 x 4
##   venue     year loc         paths
##   <chr>    <int> <geo>       <lis>
## 1 rstudio…  2017 28°N  82°W  <geo>
## 2 rstudio…  2018 33°N 117°W  <geo>
## 3 rstudio…  2019 NA          <geo>

Adding color

Both new_pillar_shaft_simple() and new_ornament() accept ANSI escape codes for coloring, emphasis, or other ways of highlighting text on terminals that support it. Some formattings are predefined, e.g. style_subtle() displays text in a light gray. For default data types, this style is used for insignificant digits. We’ll be formatting the degree and minute signs in a subtle style, because they serve only as separators. You can also use the crayon package to add custom formattings to your output.

#' @importFrom pillar pillar_shaft
#' @export
pillar_shaft.latlon <- function(x, ...) {
  out <- format(x, formatter = deg_min_color)
  out[is.na(x)] <- NA
  pillar::new_pillar_shaft_simple(out, align = "left", na_indent = 5)
}

deg_min_color <- function(x, pm) {
  sign <- sign(x)
  x <- abs(x)
  deg <- trunc(x)
  x <- x - deg
  rad <- round(x * 60)
  ret <- sprintf(
    "%d%s%.2d%s%s",
    deg,
    pillar::style_subtle("°"),
    rad,
    pillar::style_subtle("'"),
    pm[ifelse(sign >= 0, 1, 2)]
  )
  ret[is.na(x)] <- ""
  format(ret, justify = "right")
}

data
## # A tibble: 3 x 4
##   venue          year loc              paths 
##   <chr>         <int> <geo>            <list>
## 1 rstudio::conf  2017 28°20'N  81°33'W <geo> 
## 2 rstudio::conf  2018 32°43'N 117°10'W <geo> 
## 3 rstudio::conf  2019      NA          <geo>

Currently, ANSI escapes are not rendered in vignettes, so the display here isn’t much different from earlier examples. This may change in the future.

Fixing list columns

To tweak the output in the paths column, we simply need to indicate that our class is an S3 vector:

#' @importFrom pillar is_vector_s3
#' @export
is_vector_s3.latlon <- function(x) TRUE

data
## # A tibble: 3 x 4
##   venue          year loc              paths    
##   <chr>         <int> <geo>            <list>   
## 1 rstudio::conf  2017 28°20'N  81°33'W <geo [1]>
## 2 rstudio::conf  2018 32°43'N 117°10'W <geo [2]>
## 3 rstudio::conf  2019      NA          <geo [1]>

This is picked up by the default implementation of obj_sum(), which then shows the type and the length in brackets. If your object is built on top of an atomic vector the default will be adequate. You, will, however, need to provide an obj_sum() method for your class if your object is vectorised and built on top of a list.

An example of an object of this type in base R is POSIXlt: it is a list with 9 components.

x <- as.POSIXlt(Sys.time() + c(0, 60, 3600)) 
str(unclass(x))
## List of 11
##  $ sec   : num [1:3] 52 52 52
##  $ min   : int [1:3] 58 59 58
##  $ hour  : int [1:3] 0 0 1
##  $ mday  : int [1:3] 22 22 22
##  $ mon   : int [1:3] 0 0 0
##  $ year  : int [1:3] 118 118 118
##  $ wday  : int [1:3] 1 1 1
##  $ yday  : int [1:3] 21 21 21
##  $ isdst : int [1:3] 0 0 0
##  $ zone  : chr [1:3] "CET" "CET" "CET"
##  $ gmtoff: int [1:3] 3600 3600 3600
##  - attr(*, "tzone")= chr [1:3] "" "CET" "CEST"

But it pretends to be a vector with 3 elements:

x
## [1] "2018-01-22 00:58:52 CET" "2018-01-22 00:59:52 CET"
## [3] "2018-01-22 01:58:52 CET"
length(x)
## [1] 3
str(x)
##  POSIXlt[1:3], format: "2018-01-22 00:58:52" "2018-01-22 00:59:52" ...

So we need to define a method that returns a character vector the same length as x:

#' @importFrom pillar obj_sum
#' @export
obj_sum.POSIXlt <- function(x) {
  rep("POSIXlt", length(x))
}

Testing

If you want to test the output of your code, you can compare it with a known state recorded in a text file. For this, pillar offers the expect_known_display() expectation which requires and works best with the testthat package. Make sure that the output is generated only by your package to avoid inconsistencies when external code is updated. Here, this means that you test only the shaft portion of the pillar, and not the entire pillar or even a tibble that contains a column with your data type!

The tests work best with the testthat package:

library(testthat)

The code below will compare the output of pillar_shaft(data$loc) with known output stored in the latlon.txt file. The first run warns because the file doesn’t exist yet.

test_that("latlon pillar matches known output", {
  pillar::expect_known_display(
    pillar_shaft(data$loc),
    file = "latlon.txt"
  )
})

From the second run on, the printing will be compared with the file:

test_that("latlon pillar matches known output", {
  pillar::expect_known_display(
    pillar_shaft(data$loc),
    file = "latlon.txt"
  )
})

However, if we look at the file we’ll notice strange things: The output contains ANSI escapes!

readLines("latlon.txt")
## [1] "28\033[90m°\033[39m20\033[90m'\033[39mN  81\033[90m°\033[39m33\033[90m'\033[39mW"
## [2] "32\033[90m°\033[39m43\033[90m'\033[39mN 117\033[90m°\033[39m10\033[90m'\033[39mW"
## [3] "     \033[31mNA\033[39m         "

We can turn them off by passing crayon = FALSE to the expectation, but we need to run twice again:

library(testthat)
test_that("latlon pillar matches known output", {
  pillar::expect_known_display(
    pillar_shaft(data$loc),
    file = "latlon.txt",
    crayon = FALSE
  )
})
## Error: Test failed: 'latlon pillar matches known output'
## * `print(eval_tidy(object))` has changed from known value recorded in 'latlon.txt'.
## 3/3 mismatches
## x[1]: "28°20'N  81°33'W"
## y[1]: "28\033[90m°\033[39m20\033[90m'\033[39mN  81\033[90m°\033[39m33\033[90m'\033[39mW"
## 
## x[2]: "32°43'N 117°10'W"
## y[2]: "32\033[90m°\033[39m43\033[90m'\033[39mN 117\033[90m°\033[39m10\033[90m'\033[39mW"
## 
## x[3]: "     NA         "
## y[3]: "     \033[31mNA\033[39m         "
test_that("latlon pillar matches known output", {
  pillar::expect_known_display(
    pillar_shaft(data$loc),
    file = "latlon.txt",
    crayon = FALSE
  )
})

readLines("latlon.txt")
## [1] "28°20'N  81°33'W" "32°43'N 117°10'W" "     NA         "

You may want to create a series of output files for different scenarios:

For this it is helpful to create your own expectation function. Use the tidy evaluation framework to make sure that construction and printing happens at the right time:

expect_known_latlon_display <- function(x, file_base) {
  quo <- rlang::quo(pillar::pillar_shaft(x))
  pillar::expect_known_display(
    !! quo,
    file = paste0(file_base, ".txt")
  )
  pillar::expect_known_display(
    !! quo,
    file = paste0(file_base, "-bw.txt"),
    crayon = FALSE
  )
}
test_that("latlon pillar matches known output", {
  expect_known_latlon_display(data$loc, file_base = "latlon")
})
## Error: Test failed: 'latlon pillar matches known output'
## * `print(eval_tidy(object))` has changed from known value recorded in 'latlon.txt'.
## 3/3 mismatches
## x[1]: "28\033[90m°\033[39m20\033[90m'\033[39mN  81\033[90m°\033[39m33\033[90m'\033[39mW"
## y[1]: "28°20'N  81°33'W"
## 
## x[2]: "32\033[90m°\033[39m43\033[90m'\033[39mN 117\033[90m°\033[39m10\033[90m'\033[39mW"
## y[2]: "32°43'N 117°10'W"
## 
## x[3]: "     \033[31mNA\033[39m         "
## y[3]: "     NA         "
readLines("latlon.txt")
## [1] "28\033[90m°\033[39m20\033[90m'\033[39mN  81\033[90m°\033[39m33\033[90m'\033[39mW"
## [2] "32\033[90m°\033[39m43\033[90m'\033[39mN 117\033[90m°\033[39m10\033[90m'\033[39mW"
## [3] "     \033[31mNA\033[39m         "
readLines("latlon-bw.txt")
## [1] "28°20'N  81°33'W" "32°43'N 117°10'W" "     NA         "

Learn more about the tidyeval framework in the dplyr vignette.