Aggregation is the process forming groups of data based on the levels of categorical variables, then computing some statistical value, such as a mean, for each group on a numeric variable. For example, compute the mean Salary for each combination of the levels of Gender and Department. Refer to each group, the cross-classification of all the specified categorical variables, as a *cell*.

The following table lists many of the available statistical functions by which to aggregate over the cells.

Statistic | Meaning |
---|---|

`sum` |
sum |

`mean` |
arithmetic mean |

`median` |
median |

`min` |
minimum |

`max` |
maximum |

`sd` |
standard deviation |

`var` |
variance |

`IQR` |
inter-quartile range |

`mad` |
mean absolute deviation |

`tabulate` |
count of each cell only |

Some statistical functions are available that return multiple values.

Statistic | Meaning |
---|---|

`range` |
minimum, maximum |

`quantile` |
range + quartiles |

`summary` |
quantile + mean |

These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame.

The following `pivot()`

parameters specify the data, the statistic for the aggregation, the value(s) over which to aggregate, and the corresponding cells that contain the aggregated values. The first four parameter values listed below are required.

`data`

: The data frame that includes the variables of interest.`compute`

: The function for which to perform the aggregation.`value`

: The variable(s) for which to summarize, i.e., aggregate.`rows_by`

: The categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values.`cols_by`

: The optional categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.

Multiple values of parameters `value`

, `rows_by`

, and up to two `cols_by`

variables may be specified. Express the multiple categorical variables over which to pivot as a vector, such as with the `c()`

function. If `cols_by`

is not specified, the result is a linear table as a long-form data frame that can be input into other data analysis procedures.

By default, missing values are eliminated from the aggregation. If there are no values for a cell for which to perform the aggregation, and it is desired to have the aggregated value listed as missing, then specify `na_value`

as `TRUE`

. If any of the levels of the `by`

variables are missing, to report those missing cells, specify `na_by`

as `TRUE`

. Set `na_remove`

to `TRUE`

to have any aggregated value defined as missing if any individual data cell for `value`

is missing.

To illustrate, use the Employee data set included in **lessR**, here read into the *d* data frame. Begin with no missing data.

```
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
```

Two categorical variables in the *d* data frame are *Dept* and *Salary*. A continuous variable is *Salary*. Create the long-form pivot table as a data frame that expresses the mean of *Salary* for all combinations of *Dept* and *Salary*.

This example includes the parameter names for emphasis, but if the parameters are entered in this order, the names are not necessary.

```
## Dept Gender n miss Salary
## 1 ACCT F 3 0 63237.16
## 2 ADMN F 4 0 81434.00
## 3 FINC F 1 0 57139.90
## 4 MKTG F 5 0 64496.02
## 5 SALE F 5 0 64188.25
## 6 ACCT M 2 0 59626.20
## 7 ADMN M 2 0 80963.35
## 8 FINC M 3 0 72967.60
## 9 MKTG M 1 0 99062.66
## 10 SALE M 10 0 86150.97
```

The output of `pivot()`

is a data frame of the aggregated variables. This can be saved for further analysis. Here, perform the same analysis, but list the required parameter values in order without the parameter names.

Because the output of `pivot()`

with no `cols_by`

variables is a standard R data frame, typical operations such as sorting can be applied, here using the **lessR** function `Sort()`

.

```
##
## Sort Specification
## 1 --> ascending
```

```
## Dept Gender n miss Salary
## 1 ACCT F 3 0 63237.16
## 6 ACCT M 2 0 59626.20
## 2 ADMN F 4 0 81434.00
## 7 ADMN M 2 0 80963.35
## 3 FINC F 1 0 57139.90
## 8 FINC M 3 0 72967.60
## 4 MKTG F 5 0 64496.02
## 9 MKTG M 1 0 99062.66
## 5 SALE F 5 0 64188.25
## 10 SALE M 10 0 86150.97
```

Multiple `value`

variables for which to aggregate over can also be specified. Round the numerical aggregated results to two decimal digits.

```
## Dept Gender n miss Years Salary
## 1 ACCT F 3 0 4.67 63237.16
## 2 ADMN F 4 0 7.50 81434.00
## 3 FINC F 1 0 7.00 57139.90
## 4 MKTG F 5 0 8.20 64496.02
## 5 SALE F 5 0 6.60 64188.25
## 6 ACCT M 2 0 7.00 59626.20
## 7 ADMN M 2 0 15.50 80963.34
## 8 FINC M 3 0 11.33 72967.60
## 9 MKTG M 1 0 18.00 99062.66
## 10 SALE M 10 0 12.33 86150.97
```

The following illustrates the pipe operator from the **magrittr** package with `pivot()`

. The package is not included with **lessR**, so separately load with `library()`

, either by itself or as part of the **tidyverse** package. As such, the following code is not run.

Perform the pivot computations on the *d* data frame. Then output the aggregated results to the *a* data frame.

Specify up to two `cols_by`

categorical variables, to create a two-dimensional table with the specified columns. First, one `cols_by`

variable, Gender. Specifying one or two categorical variables as `cols_by`

variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a cross-classification table with categorical variables in the rows and columns.

```
Table: mean of Salary
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.34
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
```

Here two `cols_by`

variables, specified as a vector.

```
Table: mean of Salary
Gender F M
Plan 1 2 3 1 2 3
Dept
------- --------- --------- --------- --------- ---------- ---------
MKTG 56772.95 66426.79 NA NA NA 99062.66
SALE 60941.54 66352.73 NA 89393.40 82442.74 80131.91
ACCT NA 63237.16 NA 69547.60 NA 49704.79
ADMN NA 67724.21 122563.4 53788.26 108138.43 NA
FINC NA 57139.90 NA 61937.62 NA 95027.55
```

Tabulation is counting. With tabulation there is no `value`

variable per se. Instead, the number of data values in each cell are tabulated, that is, counted.

```
## Dept Gender n
## 1 ACCT F 3
## 2 ADMN F 4
## 3 FINC F 1
## 4 MKTG F 5
## 5 SALE F 5
## 6 ACCT M 2
## 7 ADMN M 2
## 8 FINC M 3
## 9 MKTG M 1
## 10 SALE M 10
```

The pivot table follows with the missing data analysis when there is only one numerical value over which to aggregate. The variable *Years* has one missing value.

```
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
## 5 SALE 14 1 10.285714
```

Set `na_remove`

to `FALSE`

to *not* remove any missing data in a cell with values to be aggregated. The resulting aggregated value will be missing if any of the constituent data values are missing. The corresponding level does not appear because cells with aggregated missing values are not shown.

```
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
```

The `na_remove`

parameter specifies the value of the base R parameter `na.rm`

for computations such as for the `mean`

. See `?mean`

for its definition.

Here include all the cells, even those with a missing aggregated value.

```
## by n miss Years
## 1 ACCT 5 0 5.600000
## 2 ADMN 6 0 10.166667
## 3 FINC 4 0 10.250000
## 4 MKTG 6 0 9.833333
## 5 SALE 14 1 NA
```

To account for missing values of the categorical `by`

variables, set `na_by`

to `TRUE`

.

```
## Dept Gender n miss Years
## 1 ACCT F 3 0 4.666667
## 2 ADMN F 4 0 7.500000
## 3 FINC F 1 0 7.000000
## 4 MKTG F 5 0 8.200000
## 5 SALE F 5 0 6.600000
## 6 <NA> F 1 0 5.000000
## 7 ACCT M 2 0 7.000000
## 8 ADMN M 2 0 15.500000
## 9 FINC M 3 0 11.333333
## 10 MKTG M 1 0 18.000000
## 11 SALE M 9 1 12.333333
```

Specify the count for levels that include missing cells. Invoke `na_by`

for tabulation.

```
## Dept Gender n
## 1 ACCT F 3
## 2 ADMN F 4
## 3 FINC F 1
## 4 MKTG F 5
## 5 SALE F 5
## 6 <NA> F 1
## 7 ACCT M 2
## 8 ADMN M 2
## 9 FINC M 3
## 10 MKTG M 1
## 11 SALE M 10
## 12 <NA> M 0
```

`pivot()`

vs `aggregate()`

The **lessR** `pivot()`

function relies upon the **base R** function `aggregate()`

for aggregation. By default, `pivot()`

generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis functions as a standard data frame. The levels for all the grouping variables are listed in the rows, but if there are specified column grouping variables, `pivot()`

relies upon base R `reshape()`

to form a 2-d table for direct viewing instead of a data table to input into further analysis functions.

`pivot()`

provides additional features than `aggregate()`

provides.

- If only one
`value`

over which to aggregate, the sample size for each cell is provided with the number of missing values. - Missing data analysis by cell or by
`value`

aggregated. - The aggregated computations can be displayed as a 2-d table, not just a long-form data frame.
- The data parameter is listed first in the parameter list, which facilitates the use of the pipe operator from the
**magrittr**package.

Also, there is a different interface as the variables, for rows and columns, are each specified as a vector.