Rolling calculations in tibbletime

Davis Vaughan

2017-10-06

Introducing rollify()

A common task in financial analyses is to perform a rolling calculation. This might be a single value like a rolling mean or standard deviation, or it might be more complicated like a rolling linear regression. To account for this flexibility, tibbletime has the rollify() function. This function allows you to turn any function into a rolling version of itself.

In the tidyverse, this type of function is known as an adverb because it modifies an existing function, which are typically given verb names.

Datasets required

library(tibbletime)
library(dplyr)
library(tidyr)

# Facebook stock prices.
data(FB)

# Only a few columns
FB <- select(FB, symbol, date, open, close, adjusted)

A rolling average

To calculate a rolling average, picture a column in a data frame where you take the average of the values in rows 1-5, then in rows 2-6, then in 3-7, and so on until you reach the end of the dataset. This type of 5-period moving window is a rolling calculation, and is often used to smooth out noise in a dataset.

Let’s see how to do this with rollify().

# The function to use at each step is `mean`.
# The window size is 5
rolling_mean <- rollify(mean, window = 5)

rolling_mean
## function (...) 
## {
##     roller(..., .f = .f, window = window, unlist = unlist, na_value = na_value)
## }
## <environment: 0x7fa8ce6702b0>

We now have a rolling version of the function, mean(). You use it in a similar way to how you might use mean().

mutate(FB, mean_5 = rolling_mean(adjusted))
## # A tibble: 1,008 x 6
##    symbol       date  open close adjusted mean_5
##     <chr>     <date> <dbl> <dbl>    <dbl>  <dbl>
##  1     FB 2013-01-02 27.44 28.00    28.00     NA
##  2     FB 2013-01-03 27.88 27.77    27.77     NA
##  3     FB 2013-01-04 28.01 28.76    28.76     NA
##  4     FB 2013-01-07 28.69 29.42    29.42     NA
##  5     FB 2013-01-08 29.51 29.06    29.06 28.602
##  6     FB 2013-01-09 29.67 30.59    30.59 29.120
##  7     FB 2013-01-10 30.60 31.30    31.30 29.826
##  8     FB 2013-01-11 31.28 31.72    31.72 30.418
##  9     FB 2013-01-14 32.08 30.95    30.95 30.724
## 10     FB 2013-01-15 30.64 30.10    30.10 30.932
## # ... with 998 more rows

You can create multiple versions of the rolling function if you need to calculate the mean at multiple window lengths.

rolling_mean_2 <- rollify(mean, window = 2)
rolling_mean_3 <- rollify(mean, window = 3)
rolling_mean_4 <- rollify(mean, window = 4)

FB %>% mutate(
  rm10 = rolling_mean_2(adjusted),
  rm20 = rolling_mean_3(adjusted),
  rm30 = rolling_mean_4(adjusted)
)
## # A tibble: 1,008 x 8
##    symbol       date  open close adjusted   rm10     rm20    rm30
##     <chr>     <date> <dbl> <dbl>    <dbl>  <dbl>    <dbl>   <dbl>
##  1     FB 2013-01-02 27.44 28.00    28.00     NA       NA      NA
##  2     FB 2013-01-03 27.88 27.77    27.77 27.885       NA      NA
##  3     FB 2013-01-04 28.01 28.76    28.76 28.265 28.17667      NA
##  4     FB 2013-01-07 28.69 29.42    29.42 29.090 28.65000 28.4875
##  5     FB 2013-01-08 29.51 29.06    29.06 29.240 29.08000 28.7525
##  6     FB 2013-01-09 29.67 30.59    30.59 29.825 29.69000 29.4575
##  7     FB 2013-01-10 30.60 31.30    31.30 30.945 30.31667 30.0925
##  8     FB 2013-01-11 31.28 31.72    31.72 31.510 31.20333 30.6675
##  9     FB 2013-01-14 32.08 30.95    30.95 31.335 31.32333 31.1400
## 10     FB 2013-01-15 30.64 30.10    30.10 30.525 30.92333 31.0175
## # ... with 998 more rows

Purrr functional syntax

rollify() is built using pieces from the purrr package. One of those is the ability to accept an anonymous function using the ~ function syntax.

The documentation, ?rollify, gives a thorough walkthrough of the different forms you can pass to rollify(), but let’s see a few more examples.

# Rolling mean, but with function syntax
rolling_mean <- rollify(.f = ~mean(.x), window = 5)

mutate(FB, mean_5 = rolling_mean(adjusted))
## # A tibble: 1,008 x 6
##    symbol       date  open close adjusted mean_5
##     <chr>     <date> <dbl> <dbl>    <dbl>  <dbl>
##  1     FB 2013-01-02 27.44 28.00    28.00     NA
##  2     FB 2013-01-03 27.88 27.77    27.77     NA
##  3     FB 2013-01-04 28.01 28.76    28.76     NA
##  4     FB 2013-01-07 28.69 29.42    29.42     NA
##  5     FB 2013-01-08 29.51 29.06    29.06 28.602
##  6     FB 2013-01-09 29.67 30.59    30.59 29.120
##  7     FB 2013-01-10 30.60 31.30    31.30 29.826
##  8     FB 2013-01-11 31.28 31.72    31.72 30.418
##  9     FB 2013-01-14 32.08 30.95    30.95 30.724
## 10     FB 2013-01-15 30.64 30.10    30.10 30.932
## # ... with 998 more rows

You can create anonymous functions (functions without a name) on the fly.

# 5 period average of 2 columns (open and close)
rolling_avg_sum <- rollify(~ mean(.x + .y), window = 5)

mutate(FB, avg_sum = rolling_avg_sum(open, close))
## # A tibble: 1,008 x 6
##    symbol       date  open close adjusted avg_sum
##     <chr>     <date> <dbl> <dbl>    <dbl>   <dbl>
##  1     FB 2013-01-02 27.44 28.00    28.00      NA
##  2     FB 2013-01-03 27.88 27.77    27.77      NA
##  3     FB 2013-01-04 28.01 28.76    28.76      NA
##  4     FB 2013-01-07 28.69 29.42    29.42      NA
##  5     FB 2013-01-08 29.51 29.06    29.06  56.908
##  6     FB 2013-01-09 29.67 30.59    30.59  57.872
##  7     FB 2013-01-10 30.60 31.30    31.30  59.122
##  8     FB 2013-01-11 31.28 31.72    31.72  60.368
##  9     FB 2013-01-14 32.08 30.95    30.95  61.352
## 10     FB 2013-01-15 30.64 30.10    30.10  61.786
## # ... with 998 more rows

Optional arguments

To pass optional arguments (not .x or .y) to your rolling function, they must be specified in the non-rolling form in the call to rollify().

For instance, say our dataset had NA values, but we still wanted to calculate an average. We need to specify na.rm = TRUE as an argument to mean().

FB$adjusted[1] <- NA

# Do this
rolling_mean_na <- rollify(~mean(.x, na.rm = TRUE), window = 5)

FB %>% mutate(mean_na = rolling_mean_na(adjusted))
## # A tibble: 1,008 x 6
##    symbol       date  open close adjusted mean_na
##     <chr>     <date> <dbl> <dbl>    <dbl>   <dbl>
##  1     FB 2013-01-02 27.44 28.00       NA      NA
##  2     FB 2013-01-03 27.88 27.77    27.77      NA
##  3     FB 2013-01-04 28.01 28.76    28.76      NA
##  4     FB 2013-01-07 28.69 29.42    29.42      NA
##  5     FB 2013-01-08 29.51 29.06    29.06 28.7525
##  6     FB 2013-01-09 29.67 30.59    30.59 29.1200
##  7     FB 2013-01-10 30.60 31.30    31.30 29.8260
##  8     FB 2013-01-11 31.28 31.72    31.72 30.4180
##  9     FB 2013-01-14 32.08 30.95    30.95 30.7240
## 10     FB 2013-01-15 30.64 30.10    30.10 30.9320
## # ... with 998 more rows
# Don't try this!
# rolling_mean_na <- rollify(~mean(.x), window = 5)
# FB %>% mutate(mean_na = rolling_mean_na(adjusted, na.rm = TRUE))

# Reset FB
data(FB)
FB <- select(FB, symbol, date, adjusted)

Returning more than 1 value per call

Say our rolling function returned a call to a custom summary_df() function. This function calculates a 5 number number summary and returns it as a tidy data frame.

We won’t be able to use the rolling version of this out of the box. dplyr::mutate() will complain that an incorrect number of values were returned since rollify() attempts to unlist at each call. Essentially, each call would be returning 5 values instead of 1. What we need is to be able to create a list-column. To do this, specify unlist = FALSE in the call to rollify().

# Our data frame summary
summary_df <- function(x) {
  data.frame(  
    rolled_summary_type = c("mean", "sd",  "min",  "max",  "median"),
    rolled_summary_val  = c(mean(x), sd(x), min(x), max(x), median(x))
  )
}

# A rolling version, with unlist = FALSE
rolling_summary <- rollify(~summary_df(.x), window = 5, 
                           unlist = FALSE)

FB_summarised <- mutate(FB, summary_list_col = rolling_summary(adjusted))
FB_summarised
## # A tibble: 1,008 x 4
##    symbol       date adjusted     summary_list_col
##     <chr>     <date>    <dbl>               <list>
##  1     FB 2013-01-02    28.00            <lgl [1]>
##  2     FB 2013-01-03    27.77            <lgl [1]>
##  3     FB 2013-01-04    28.76            <lgl [1]>
##  4     FB 2013-01-07    29.42            <lgl [1]>
##  5     FB 2013-01-08    29.06 <data.frame [5 x 2]>
##  6     FB 2013-01-09    30.59 <data.frame [5 x 2]>
##  7     FB 2013-01-10    31.30 <data.frame [5 x 2]>
##  8     FB 2013-01-11    31.72 <data.frame [5 x 2]>
##  9     FB 2013-01-14    30.95 <data.frame [5 x 2]>
## 10     FB 2013-01-15    30.10 <data.frame [5 x 2]>
## # ... with 998 more rows

The neat thing is that after removing the NA values at the beginning, the list-column can be unnested using tidyr::unnest() giving us a nice tidy 5-period rolling summary.

FB_summarised %>% 
  filter(!is.na(summary_list_col)) %>%
  unnest()
## # A tibble: 5,020 x 5
##    symbol       date adjusted rolled_summary_type rolled_summary_val
##     <chr>     <date>    <dbl>              <fctr>              <dbl>
##  1     FB 2013-01-08    29.06                mean         28.6019998
##  2     FB 2013-01-08    29.06                  sd          0.6997284
##  3     FB 2013-01-08    29.06                 min         27.7700000
##  4     FB 2013-01-08    29.06                 max         29.4200000
##  5     FB 2013-01-08    29.06              median         28.7600000
##  6     FB 2013-01-09    30.59                mean         29.1199998
##  7     FB 2013-01-09    30.59                  sd          1.0254999
##  8     FB 2013-01-09    30.59                 min         27.7700000
##  9     FB 2013-01-09    30.59                 max         30.5900000
## 10     FB 2013-01-09    30.59              median         29.0599990
## # ... with 5,010 more rows

Custom missing values

The last example was a little clunky because to unnest we had to remove the first few missing rows manually. If those missing values were empty data frames then unnest() would have known how to handle them. Luckily, the na_value argument will allow us to specify a value to fill the NA spots at the beginning of the roll.

rolling_summary <- rollify(~summary_df(.x), window = 5, 
                           unlist = FALSE, na_value = data.frame())

FB_summarised <- mutate(FB, summary_list_col = rolling_summary(adjusted))
FB_summarised
## # A tibble: 1,008 x 4
##    symbol       date adjusted     summary_list_col
##     <chr>     <date>    <dbl>               <list>
##  1     FB 2013-01-02    28.00 <data.frame [0 x 0]>
##  2     FB 2013-01-03    27.77 <data.frame [0 x 0]>
##  3     FB 2013-01-04    28.76 <data.frame [0 x 0]>
##  4     FB 2013-01-07    29.42 <data.frame [0 x 0]>
##  5     FB 2013-01-08    29.06 <data.frame [5 x 2]>
##  6     FB 2013-01-09    30.59 <data.frame [5 x 2]>
##  7     FB 2013-01-10    31.30 <data.frame [5 x 2]>
##  8     FB 2013-01-11    31.72 <data.frame [5 x 2]>
##  9     FB 2013-01-14    30.95 <data.frame [5 x 2]>
## 10     FB 2013-01-15    30.10 <data.frame [5 x 2]>
## # ... with 998 more rows

Now unnesting directly:

FB_summarised %>% 
  unnest()
## # A tibble: 5,020 x 5
##    symbol       date adjusted rolled_summary_type rolled_summary_val
##     <chr>     <date>    <dbl>              <fctr>              <dbl>
##  1     FB 2013-01-08    29.06                mean         28.6019998
##  2     FB 2013-01-08    29.06                  sd          0.6997284
##  3     FB 2013-01-08    29.06                 min         27.7700000
##  4     FB 2013-01-08    29.06                 max         29.4200000
##  5     FB 2013-01-08    29.06              median         28.7600000
##  6     FB 2013-01-09    30.59                mean         29.1199998
##  7     FB 2013-01-09    30.59                  sd          1.0254999
##  8     FB 2013-01-09    30.59                 min         27.7700000
##  9     FB 2013-01-09    30.59                 max         30.5900000
## 10     FB 2013-01-09    30.59              median         29.0599990
## # ... with 5,010 more rows

Finally, if you want to actually keep those first few NA rows in the unnest, you can pass a data frame that is initialized with the same column names as the rest of the values.

rolling_summary <- rollify(~summary_df(.x), window = 5, 
                           unlist = FALSE, 
                           na_value = data.frame(rolled_summary_type = NA,
                                                 rolled_summary_val  = NA))

FB_summarised <- mutate(FB, summary_list_col = rolling_summary(adjusted))
FB_summarised %>% unnest()
## # A tibble: 5,024 x 5
##    symbol       date adjusted rolled_summary_type rolled_summary_val
##     <chr>     <date>    <dbl>              <fctr>              <dbl>
##  1     FB 2013-01-02    28.00                <NA>                 NA
##  2     FB 2013-01-03    27.77                <NA>                 NA
##  3     FB 2013-01-04    28.76                <NA>                 NA
##  4     FB 2013-01-07    29.42                <NA>                 NA
##  5     FB 2013-01-08    29.06                mean         28.6019998
##  6     FB 2013-01-08    29.06                  sd          0.6997284
##  7     FB 2013-01-08    29.06                 min         27.7700000
##  8     FB 2013-01-08    29.06                 max         29.4200000
##  9     FB 2013-01-08    29.06              median         28.7600000
## 10     FB 2013-01-09    30.59                mean         29.1199998
## # ... with 5,014 more rows

Rolling regressions

A final use of this flexible function is to calculate rolling regressions.

A very ficticious example is to perform a rolling regression on the FB dataset of the form close ~ high + low + volume. Notice that we have 4 columns to pass here. This is more complicated than a .x and .y example, but have no fear. The arguments can be specified in order as ..1, ..2, … for as far as is required, or you can pass a freshly created anonymous function. The latter is what we will do so we can preserve the names of the variables in the regression.

Again, since this returns a linear model object, we will specify unlist = FALSE. Unfortunately there is no easy default NA value to pass here.

# Reset FB
data(FB)

rolling_lm <- rollify(.f = function(close, high, low, volume) {
                              lm(close ~ high + low + volume)
                           }, 
                      window = 5, 
                      unlist = FALSE)

FB_reg <- mutate(FB, roll_lm = rolling_lm(close, high, low, volume))
FB_reg
## # A tibble: 1,008 x 9
##    symbol       date  open  high   low close    volume adjusted   roll_lm
##     <chr>     <date> <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>    <list>
##  1     FB 2013-01-02 27.44 28.18 27.42 28.00  69846400    28.00 <lgl [1]>
##  2     FB 2013-01-03 27.88 28.47 27.59 27.77  63140600    27.77 <lgl [1]>
##  3     FB 2013-01-04 28.01 28.93 27.83 28.76  72715400    28.76 <lgl [1]>
##  4     FB 2013-01-07 28.69 29.79 28.65 29.42  83781800    29.42 <lgl [1]>
##  5     FB 2013-01-08 29.51 29.60 28.86 29.06  45871300    29.06  <S3: lm>
##  6     FB 2013-01-09 29.67 30.60 29.49 30.59 104787700    30.59  <S3: lm>
##  7     FB 2013-01-10 30.60 31.45 30.28 31.30  95316400    31.30  <S3: lm>
##  8     FB 2013-01-11 31.28 31.96 31.10 31.72  89598000    31.72  <S3: lm>
##  9     FB 2013-01-14 32.08 32.21 30.62 30.95  98892800    30.95  <S3: lm>
## 10     FB 2013-01-15 30.64 31.71 29.88 30.10 173242600    30.10  <S3: lm>
## # ... with 998 more rows

To get some useful information about the regressions, we will use broom::tidy() and apply it to each regression using a mutate() + map() combination.

FB_reg %>%
  filter(!is.na(roll_lm)) %>%
  mutate(tidied = purrr::map(roll_lm, broom::tidy)) %>%
  unnest(tidied) %>%
  select(symbol, date, term, estimate, std.error, statistic, p.value)
## # A tibble: 4,016 x 7
##    symbol       date        term      estimate    std.error    statistic
##     <chr>     <date>       <chr>         <dbl>        <dbl>        <dbl>
##  1     FB 2013-01-08 (Intercept) -2.841750e-01 1.019828e+01 -0.027865000
##  2     FB 2013-01-08        high  7.089105e-01 1.949575e+00  0.363623083
##  3     FB 2013-01-08         low  2.700745e-01 2.159749e+00  0.125049045
##  4     FB 2013-01-08      volume  1.119755e-08 2.656115e-08  0.421576293
##  5     FB 2013-01-09 (Intercept) -5.950325e+00 7.478017e+00 -0.795709000
##  6     FB 2013-01-09        high  2.078924e+00 1.881536e+00  1.104907791
##  7     FB 2013-01-09         low -9.198664e-01 1.749988e+00 -0.525641586
##  8     FB 2013-01-09      volume -1.448797e-10 1.682099e-08 -0.008613033
##  9     FB 2013-01-10 (Intercept)  9.547323e-01 4.458054e+00  0.214158976
## 10     FB 2013-01-10        high  7.165645e-01 1.296827e+00  0.552552059
## # ... with 4,006 more rows, and 1 more variables: p.value <dbl>