A2. Appendix: Calculations

Chris Bailiss

2019-11-01

In This Vignette

Introduction

This vignette describes how to construct more advanced calculations, such as percentages of totals, cumulative sums, running differences, rolling averages, and ratios/multiples.

Before reading this vignette, it may be helpful to read the Calculations and Cell Context vignettes.

Many of the examples in this vignette leave working columns (i.e. intermediate values not of interest) visible in the pivot tables. This is done deliberately to aid understanding. These calculations can be made invisible by specifying visible=FALSE as an additional parameter as part of pt$defineCalculation(...).

Basic Principles

A filter is simply the name of a variable and one or more allowed values of that variable (e.g. TrainCategory=“Express Passenger”).

Every cell in a pivot table has a set of associated filters that specify what data is included in the calculation of the value of that cell, as explained in the Cell Context vignette.

By changing these filters it is possible to perform different sorts of calculations.

The filters associated with each cell can be made visible in the pivot table by specifying pt$renderPivot(includeWorkingData=TRUE) as shown in the example below.

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot(includeWorkingData=TRUE)

Modifying filters

There are two possible ways to change the filters that are used when calculating values.

Method 1: Further restricting the data using a PivotFilters object

When defining a calculation, it is possible to specify a PivotFilters object as part of the calculation. These additional filters are applied on top of the filters already related to that cell, effectively further restricting the subset of data that is covered by the calculation.

This approach has been supported since v0.3.0 of the pivottabler package. A weakness of this method is that it can only be used to further restrict the data - it cannot be used to refer to data outside the scope of the current cell.

As an example of this method, the “Weekend Trains” calculation below specifies an extra filter that means only rows where the WeekdayOrWeekend variable equals “Weekend” are included in the “Weekend Trains” count.

Method 2: Flexible filter changes using a PivotFilterOverrides object

Newly added in v1.0.0 of the pivottabler package is the ability to specify a PivotFilterOverrides object as part of a calculation definition. This provides a more flexible and more powerful way of controlling the filters used when calculating a cell value. In particular, using a PivotFilterOverrides object it is possible to add new filters or change/remove existing filters - this enables many new calculation possibilities. Many of the examples below use this approach.

A few different options can be specified when creating a PivotFilterOverrides object:

When we talk about combining filters, we mean combining the lists of allowed values in each filter. Three different methods of combining filter criteria are supported:

The tables below provide some examples of filter overrides applied to hypothetical GbttDate and TOC variables:

Existing Criteria Override Criteria Combine Resulting Criteria
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 2nd May 2018 intersect Gbtt = 2nd May 2018
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 2nd May 2018 replace Gbtt = 2nd May 2018
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 2nd May 2018 union Gbtt = 1st, 2nd, 3rd May 2018
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 9th May 2018 intersect NONE (i.e. no matching data)
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 9th May 2018 replace Gbtt = 9th May 2018
Gbtt = 1st, 2nd, 3rd May 2018 Gbtt = 9th May 2018 union Gbtt = 1st, 2nd, 3rd, 9th May 2018

Filters for different variables are combined completely independently. Where no filter criteria exists for a particular variable, then there is an implicit filter that allows ALL values, i.e. in the tables above and below, in the Existing Criteria column, there is an implicit “TOC = ALL”.

Existing Criteria Override Combine Resulting Criteria
Gbtt = 1st, 2nd, 3rd May 2018 TOC = c2c intersect Gbtt = 1st, 2nd, 3rd May 2018, TOC=c2c
Gbtt = 1st, 2nd, 3rd May 2018 TOC = c2c replace Gbtt = 1st, 2nd, 3rd May 2018, TOC=c2c
Gbtt = 1st, 2nd, 3rd May 2018 TOC = c2c union Gbtt = 1st, 2nd, 3rd May 2018, TOC=ALL

Examples of Modifying Filters

The first example immediately below provides more explanation of the approach of overriding the filters. The other examples have less explanation, but the same approach of displaying the working data using pt$renderPivot(includeWorkingData=TRUE) can be used to display helpful filter details with these examples too.

% of Row Total, % of Column Total

Objective:

To calculate the percentage of each train operating company’s (TOC’s) trains that are either of category express passenger or ordinary passenger:

  1. Define the basic calculation (e.g. a count of trains) as normal.
  2. Define a second calculation that will return the row total for every combination in the row. For example, if the row is about a specific TOC, then define a set of filter override that removes all filters except TOC for this new calculation. Everywhere in the row will then have the total for the TOC.
  3. Define a third calculation that calculates the percentage based on the above two calculations.

In the example above, the “TOC Total” value is the same everywhere in each row. This is because the filter criteria has been overriden to remove the “TrainCategory” filter for this calculation. This can be seen by displaying the context:

% of Grand Total

Objective:

The approach is similar to the previous example, except the filter override removes all of the existing filters:

Ratios/Multiples

Objective:

We wish to find the multiples of Cross Country Express Passenger:

A “CrossCountryExpress” calculation is defined with a set of filters on TrainCategory=“Express Passenger” and TOC=“CrossCountry” that replace the existing filters on these variables:

Subsets of Data

Objective:

We wish to find the percentage of trains with PowerType=“DMU”.

A “CountDMU” calculation is defined that applies an additional filter on PowerType=“DMU”, i.e. that is combined using action=intersect:

More filter combination examples

While the above examples have practical use, the following couple of examples are more for illustrative purposes to show what happens when the filter overrides are specified using the same variable as in either the row or column headings. Again, these examples follow the rules described in the table above.

In the following examples, the TOC variable makes up the row headings and a filter override is specified where TOC="London Midland". This filter override is combined in three different ways with the existing filters from the row headings.

intersect:

In the above example the filter override in the “Test Count” calculation intersects the TOC criteria in every row with TOC="London Midland". The intersection of TOC="London Midland" with anything other than TOC="London Midland" results in an empty set, so only the “London Midland” row has any matching data.

Another way of looking at this is to say that it is impossible for any row in the source data frame to simultaneously have two different values for TOC, e.g. in a single data frame row, TOC cannot equal both “Arriva Trains Wales” and “London Midland”. Therefore, the only row where the “Test Count” calculation has a value is the “London Midland” row.

replace:

In the above example the filter override in the “Test Count” calculation replaces the TOC criteria in every row with TOC="London Midland".

union:

In the above example the filter override in the “Test Count” calculation unions the TOC criteria in every row with TOC="London Midland". So the London Midland count is included in every row for “Test Count”, e.g. the value of “Test Count” for “Arriva Trains Wales” includes the values for both “Arriva Trains Wales” and “London Midland”, i.e. 3079 + 14487 = 17566.

Examples using a custom filter override function

The above examples provide different basic ways of modifying filters. For relative / more dynamic filters, a custom filter override function can be used. The package will invoke this function once per cell. Within the function it is possible to examine the existing filters and selectively change them on a cell by cell basis.

When invoked the custom function is provided with a reference to the existing filters and a reference to the current cell. In the majority of cases, working with the filters only will be sufficient. The reference to the cell is provided only for those rare cases where additional information is needed about the cell (e.g. to determine the exact cell location, whether it is a total cell, etc). No changes should be made to the cell, only to the filters.

Running Differences/Changes

Objective:

We wish to show the change in the number of trains compared to the previous day for the first seven days of 2017.

A custom filter override function is used to change the filter from the current date (i.e. the date in the current row) to the previous date:

library(dplyr)
trains <- bhmtrains %>%
  mutate(GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
         GbttDate=as.Date(GbttDateTime))
januaryDates <- seq(as.Date("2017-01-01"), as.Date("2017-01-07"), by="days")

# comparison to yesterday
# date filter function to return yesterday
getYesterdayDateFilter <- function(pt, filters, cell) {
  # get the date filter
  filter <- filters$getFilter("GbttDate")
  if(is.null(filter)||(filter$type=="ALL")||(length(filter$values)>1)) {
    # there is no filter on GbttDate in this cell
    # i.e. we are in one of the total cells that covers all dates,
    # so the concept of yesterday has no meaning, so block all dates
    newFilter <- PivotFilter$new(pt, variableName="GbttDate", type="NONE")
    filters$setFilter(newFilter, action="replace")
  }
  else {
    # get the date value and subtract one day
    date <- filter$values
    date <- date - 1
    filter$values <- date
  }
}
# build the pivot
library(pivottabler)
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("GbttDate", fromData=FALSE, 
                    explicitListOfValues=as.list(januaryDates), visualTotals=TRUE)
pt$defineCalculation(calculationName="CountTrains", summariseExpression="n()", 
                     caption="Current Day Count")
filterOverrides <- PivotFilterOverrides$new(pt, overrideFunction=getYesterdayDateFilter)
pt$defineCalculation(calculationName="CountPreviousDayTrains", filters=filterOverrides, 
                     summariseExpression="n()", caption="Previous Day Count")
pt$defineCalculation(calculationName="Daily Change", type="calculation", 
                     basedOn=c("CountTrains", "CountPreviousDayTrains"),
                     calculationExpression="values$CountTrains-values$CountPreviousDayTrains", 
                     caption="Daily Change")
pt$renderPivot()

The above example uses visualTotals=TRUE, since the data frame contains other dates outside of the range 1st to 7th January. If visual totals is not enabled, then the column totals include data for all dates, i.e. the column totals in the pivot table would be greater than the sum of the values in the pivot table column.

Rolling Average

Objective:

We wish to show a three-day rolling average of train count for the first seven days of 2017.

A custom filter override function is used to change the filter to include the previous, current and following date:

library(dplyr)
trains <- bhmtrains %>%
  mutate(GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
         GbttDate=as.Date(GbttDateTime))
januaryDates <- seq(as.Date("2017-01-01"), as.Date("2017-01-07"), by="days")

# three-day rolling average
# date filter function to a three day range of dates
getThreeDayFilter <- function(pt, filters, cell) {
  # get the date filter
  filter <- filters$getFilter("GbttDate")
  if(is.null(filter)||(filter$type=="ALL")||(length(filter$values)>1)) {
    # there is no filter on GbttDate in this cell
    # i.e. we are in one of the total cells that covers all dates,
    # so the concept of previous/next day has no meaning, so block all dates
    newFilter <- PivotFilter$new(pt, variableName="GbttDate", type="NONE")
    filters$setFilter(newFilter, action="replace")
  }
  else {
    # get the date value and create three day filter
    date <- filter$values
    newDates <- seq(date-1, date+1, by="days")
    filter$values <- newDates
  }
}
# build the pivot
library(pivottabler)
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("GbttDate", fromData=FALSE, 
                    explicitListOfValues=as.list(januaryDates), visualTotals=TRUE)
pt$defineCalculation(calculationName="CountTrains", summariseExpression="n()", 
                     caption="Current Day Count")
filterOverrides <- PivotFilterOverrides$new(pt, overrideFunction=getThreeDayFilter)
pt$defineCalculation(calculationName="ThreeDayCount", filters=filterOverrides, 
                     summariseExpression="n()", caption="Three Day Total")
pt$defineCalculation(calculationName="ThreeDayAverage", type="calculation", 
                     basedOn="ThreeDayCount",
                     calculationExpression="values$ThreeDayCount/3", 
                     format="%.1f", caption="Three Day Rolling Average")
pt$renderPivot()

The above example also uses visualTotals=TRUE for the same reasons as described in the previous example.

Another variation of the above that is useful when full data for the rolling average is not available for the first and last days:

library(dplyr)
library(dplyr)

# here the trains data frame does not contain data for 31st Dec and 8th Jan,
# so the rolling average for 1st Jan and 7th Jan will be incomplete.
trains <- bhmtrains %>%
  mutate(GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
         GbttDate=as.Date(GbttDateTime)) %>%
  filter((as.Date("2017-01-01") <= GbttDate) & (GbttDate <= as.Date("2017-01-07")))

# three-day rolling average
# date filter function to a three day range of dates
getThreeDayFilter <- function(pt, filters, cell) {
  # get the date filter
  filter <- filters$getFilter("GbttDate")
  if(is.null(filter)||(filter$type=="ALL")||(length(filter$values)>1)) {
    # there is no filter on GbttDate in this cell
    # i.e. we are in one of the total cells that covers all dates,
    # so the concept of previous/next day has no meaning, so block all dates
    newFilter <- PivotFilter$new(pt, variableName="GbttDate", type="NONE")
    filters$setFilter(newFilter, action="replace")
  }
  else {
    # get the date value and create three day filter
    date <- filter$values
    newDates <- seq(date-1, date+1, by="days")
    filter$values <- newDates
  }
}
# build the pivot
library(pivottabler)
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("GbttDate")
pt$defineCalculation(calculationName="CountTrains", summariseExpression="n()", 
                     caption="Current Day Count")
filterOverrides <- PivotFilterOverrides$new(pt, overrideFunction=getThreeDayFilter)
pt$defineCalculation(calculationName="DaysWithDataCount", filters=filterOverrides, 
                     summariseExpression="n_distinct(GbttDate)", caption="Days With Data")
pt$defineCalculation(calculationName="ThreeDayCount", filters=filterOverrides, 
                     summariseExpression="n()", caption="Three Day Total")
pt$defineCalculation(calculationName="ThreeDayAverage", type="calculation", 
     basedOn=c("DaysWithDataCount", "ThreeDayCount"),
     calculationExpression="ifelse(values$DaysWithDataCount==3, values$ThreeDayCount/3, NA)", 
     format="%.1f", caption="Three Day Rolling Average")
pt$renderPivot()

Cumulative Sum

Objective:

We wish to show a cumulative sum of the number of trains for the first seven days of 2017.

A custom filter override function is used to change the filter to include dates between the 1st January 2017 and the current date:

Further Reading

The full set of vignettes is:

  1. Introduction
  2. Data Groups
  3. Calculations
  4. Outputs
  5. Latex Output
  6. Styling
  7. Finding and Formatting
  8. Cell Context
  9. Irregular Layout
  10. Performance
  11. Shiny
  12. Excel Export
  13. Appendix: Details
  14. Appendix: Calculations
  15. Appendix: Class Overview