Pivot Table

David Gerbing

library("lessR")
library("knitr")

Aggregation is the process forming groups of data based on the levels of categorical variables, then computing some statistical value, such as a mean, for each group on a numeric variable. For example, compute the mean Salary for each combination of the levels of Gender and Department. Refer to each group, the cross-classification of all the specified categorical variables, as a cell.

The following table lists many of the available statistical functions by which to aggregate over the cells.

Statistic Meaning
sum sum
mean arithmetic mean
median median
min minimum
max maximum
sd standard deviation
var variance
IQR inter-quartile range
mad mean absolute deviation
tabulate count of each cell only

Some statistical functions are available that return multiple values.

Statistic Meaning
range minimum, maximum
quantile range + quartiles
summary quantile + mean

These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame.

Parameters

The following pivot() parameters specify the data, the statistic for the aggregation, the value(s) over which to aggregate, and the corresponding cells that contain the aggregated values. The first four parameter values listed below are required.

  1. data: The data frame that includes the variables of interest.
  2. compute: The function for which to perform the aggregation.
  3. value: The variable(s) for which to summarize, i.e., aggregate.
  4. rows_by: The categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values.
  5. cols_by: The optional categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.

Multiple values of parameters value, rows_by, and up to two cols_by variables may be specified. Express the multiple categorical variables over which to pivot as a vector, such as with the c() function. If cols_by is not specified, the result is a linear table as a long-form data frame that can be input into other data analysis procedures.

By default, missing values are eliminated from the aggregation. If there are no values for a cell for which to perform the aggregation, and it is desired to have the aggregated value listed as missing, then specify na_value as TRUE. If any of the levels of the by variables are missing, to report those missing cells, specify na_by as TRUE. Set na_remove to TRUE to have any aggregated value defined as missing if any individual data cell for value is missing.

Examples

To illustrate, use the Employee data set included in lessR, here read into the d data frame. Begin with no missing data.

d <- Read("Employee")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

Output as a long-form data frame

Two categorical variables in the d data frame are Dept and Salary. A continuous variable is Salary. Create the long-form pivot table as a data frame that expresses the mean of Salary for all combinations of Dept and Salary.

This example includes the parameter names for emphasis, but if the parameters are entered in this order, the names are not necessary.

pivot(data=d, compute=mean, value=Salary, rows_by=c(Dept, Gender))
##    Dept Gender  n miss   Salary
## 1  ACCT      F  3    0 63237.16
## 2  ADMN      F  4    0 81434.00
## 3  FINC      F  1    0 57139.90
## 4  MKTG      F  5    0 64496.02
## 5  SALE      F  5    0 64188.25
## 6  ACCT      M  2    0 59626.20
## 7  ADMN      M  2    0 80963.35
## 8  FINC      M  3    0 72967.60
## 9  MKTG      M  1    0 99062.66
## 10 SALE      M 10    0 86150.97

The output of pivot() is a data frame of the aggregated variables. This can be saved for further analysis. Here, perform the same analysis, but list the required parameter values in order without the parameter names.

a <- pivot(d, mean, Salary, c(Dept, Gender))

Because the output of pivot() with no cols_by variables is a standard R data frame, typical operations such as sorting can be applied, here using the lessR function Sort().

Sort(a, by=Dept)
## 
## Sort Specification
##   1 -->  ascending
##    Dept Gender  n miss   Salary
## 1  ACCT      F  3    0 63237.16
## 6  ACCT      M  2    0 59626.20
## 2  ADMN      F  4    0 81434.00
## 7  ADMN      M  2    0 80963.35
## 3  FINC      F  1    0 57139.90
## 8  FINC      M  3    0 72967.60
## 4  MKTG      F  5    0 64496.02
## 9  MKTG      M  1    0 99062.66
## 5  SALE      F  5    0 64188.25
## 10 SALE      M 10    0 86150.97

Multiple value variables for which to aggregate over can also be specified. Round the numerical aggregated results to two decimal digits.

pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=2)
##    Dept Gender  n miss Years   Salary
## 1  ACCT      F  3    0  4.67 63237.16
## 2  ADMN      F  4    0  7.50 81434.00
## 3  FINC      F  1    0  7.00 57139.90
## 4  MKTG      F  5    0  8.20 64496.02
## 5  SALE      F  5    0  6.60 64188.25
## 6  ACCT      M  2    0  7.00 59626.20
## 7  ADMN      M  2    0 15.50 80963.34
## 8  FINC      M  3    0 11.33 72967.60
## 9  MKTG      M  1    0 18.00 99062.66
## 10 SALE      M 10    0 12.33 86150.97

The following illustrates the pipe operator from the magrittr package with pivot(). The package is not included with lessR, so separately load with library(), either by itself or as part of the tidyverse package. As such, the following code is not run.

Perform the pivot computations on the d data frame. Then output the aggregated results to the a data frame.

# d %>% pivot(mean, Salary, c(Dept, Gender)) %>% -> a

Output as a 2-d Table

Specify up to two cols_by categorical variables, to create a two-dimensional table with the specified columns. First, one cols_by variable, Gender. Specifying one or two categorical variables as cols_by variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a cross-classification table with categorical variables in the rows and columns.

pivot(d, mean, Salary, Dept, Gender)
Table: mean of Salary 

 Gender  F          M               
 Dept                        
-------  ---------  ---------
   ACCT   63237.16   59626.20
   ADMN   81434.00   80963.34
   FINC   57139.90   72967.60
   MKTG   64496.02   99062.66
   SALE   64188.25   86150.97

Here two cols_by variables, specified as a vector.

pivot(d, mean, Salary, Dept, c(Gender, Plan))
Table: mean of Salary 

 Gender  F                                M                                      
 Plan    1          2          3          1          2           3             
 Dept                                                                     
-------  ---------  ---------  ---------  ---------  ----------  ---------
   MKTG   56772.95   66426.79         NA         NA          NA   99062.66
   SALE   60941.54   66352.73         NA   89393.40    82442.74   80131.91
   ACCT         NA   63237.16         NA   69547.60          NA   49704.79
   ADMN         NA   67724.21   122563.4   53788.26   108138.43         NA
   FINC         NA   57139.90         NA   61937.62          NA   95027.55

Tabulation

Tabulation is counting. With tabulation there is no value variable per se. Instead, the number of data values in each cell are tabulated, that is, counted.

pivot(d, tabulate, c(Dept, Gender))
##    Dept Gender  n
## 1  ACCT      F  3
## 2  ADMN      F  4
## 3  FINC      F  1
## 4  MKTG      F  5
## 5  SALE      F  5
## 6  ACCT      M  2
## 7  ADMN      M  2
## 8  FINC      M  3
## 9  MKTG      M  1
## 10 SALE      M 10

Missing Data

The pivot table follows with the missing data analysis when there is only one numerical value over which to aggregate. The variable Years has one missing value.

pivot(d, mean, Years, Dept)
##     by  n miss     Years
## 1 ACCT  5    0  5.600000
## 2 ADMN  6    0 10.166667
## 3 FINC  4    0 10.250000
## 4 MKTG  6    0  9.833333
## 5 SALE 14    1 10.285714

Set na_remove to FALSE to not remove any missing data in a cell with values to be aggregated. The resulting aggregated value will be missing if any of the constituent data values are missing. The corresponding level does not appear because cells with aggregated missing values are not shown.

pivot(d, mean, Years, Dept, na_remove=FALSE)
##     by n miss     Years
## 1 ACCT 5    0  5.600000
## 2 ADMN 6    0 10.166667
## 3 FINC 4    0 10.250000
## 4 MKTG 6    0  9.833333

The na_remove parameter specifies the value of the base R parameter na.rm for computations such as for the mean. See ?mean for its definition.

Here include all the cells, even those with a missing aggregated value.

pivot(d, mean, Years, Dept, na_remove=FALSE, na_value=TRUE)
##     by  n miss     Years
## 1 ACCT  5    0  5.600000
## 2 ADMN  6    0 10.166667
## 3 FINC  4    0 10.250000
## 4 MKTG  6    0  9.833333
## 5 SALE 14    1        NA

To account for missing values of the categorical by variables, set na_by to TRUE.

pivot(d, mean, Years, c(Dept, Gender), na_by=TRUE)
##    Dept Gender n miss     Years
## 1  ACCT      F 3    0  4.666667
## 2  ADMN      F 4    0  7.500000
## 3  FINC      F 1    0  7.000000
## 4  MKTG      F 5    0  8.200000
## 5  SALE      F 5    0  6.600000
## 6  <NA>      F 1    0  5.000000
## 7  ACCT      M 2    0  7.000000
## 8  ADMN      M 2    0 15.500000
## 9  FINC      M 3    0 11.333333
## 10 MKTG      M 1    0 18.000000
## 11 SALE      M 9    1 12.333333

Specify the count for levels that include missing cells. Invoke na_by for tabulation.

pivot(d, tabulate, c(Dept, Gender), na_by=TRUE)
##    Dept Gender  n
## 1  ACCT      F  3
## 2  ADMN      F  4
## 3  FINC      F  1
## 4  MKTG      F  5
## 5  SALE      F  5
## 6  <NA>      F  1
## 7  ACCT      M  2
## 8  ADMN      M  2
## 9  FINC      M  3
## 10 MKTG      M  1
## 11 SALE      M 10
## 12 <NA>      M  0

lessR pivot() vs Base R aggregate()

The lessR pivot() function relies upon the base R function aggregate() for aggregation. By default, pivot() generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis functions as a standard data frame. The levels for all the grouping variables are listed in the rows, but if there are specified column grouping variables, pivot() relies upon base R reshape() to form a 2-d table for direct viewing instead of a data table to input into further analysis functions.

pivot() provides additional features than aggregate() provides.

  1. If only one value over which to aggregate, the sample size for each cell is provided with the number of missing values.
  2. Missing data analysis by cell or by value aggregated.
  3. The aggregated computations can be displayed as a 2-d table, not just a long-form data frame.
  4. The data parameter is listed first in the parameter list, which facilitates the use of the pipe operator from the magrittr package.

Also, there is a different interface as the variables, for rows and columns, are each specified as a vector.