Retrieving multi-sheet XLS/XLSX resources

Excel files (XLS and XLSX) are a common form of data on the City of Toronto Open Data Portal.

In cases where the file only contains one sheet, the resource is returned as a tibble. For example, this data set on TTC Ridership Analysis from 1985 to 2018:

library(opendatatoronto)

list_package_resources("https://open.toronto.ca/dataset/ttc-ridership-analysis/") %>%
  get_resource()
#> # A tibble: 64 x 36
#>    `TORONTO TRANSI… ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
#>    <chr>            <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 ANALYSIS OF RID… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2 1985 TO 2018 AC… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3 <NA>             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  4 <NA>             FARE… 2018  2017  2016  2015… 2014  2013  2012  2011 
#>  5 WHO              ADULT <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  6 <NA>             TOKE… 46974 76106 1020… 1109… 1111… 1123… 1179… 1247…
#>  7 <NA>             TICK… N/A   N/A   N/A   N/A   N/A   N/A   N/A   N/A  
#>  8 <NA>             TWO-… N/A   N/A   N/A   N/A   N/A   N/A   N/A   N/A  
#>  9 <NA>             PRES… 1168… 67829 27397 13323 9862  8194  4399  1139 
#> 10 <NA>             PRES… 1752  N/A   N/A   N/A   N/A   N/A   N/A   N/A  
#> # … with 54 more rows, and 26 more variables: ...11 <chr>, ...12 <chr>,
#> #   ...13 <chr>, ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>,
#> #   ...18 <chr>, ...19 <chr>, ...20 <chr>, ...21 <chr>, ...22 <chr>,
#> #   ...23 <chr>, ...24 <chr>, ...25 <chr>, ...26 <chr>, ...27 <chr>,
#> #   ...28 <chr>, ...29 <chr>, ...30 <chr>, ...31 <chr>, ...32 <chr>,
#> #   ...33 <chr>, ...34 <chr>, ...35 <chr>, ...36 <chr>

When the file contains multiple sheets, the resource is returned as a named list, where the names are the names of the sheets, as in the dataset on Wellbeing Toronto Demographics:

library(dplyr)

wellbeing_toronto_demographics <- list_package_resources("https://open.toronto.ca/dataset/wellbeing-toronto-demographics/") %>%
  filter(name == "wellbeing-toronto-demographics") %>%
  get_resource()

str(wellbeing_toronto_demographics, max.level = 1)
#> List of 3
#>  $ IndicatorMetaData      :Classes 'tbl_df', 'tbl' and 'data.frame': 87 obs. of  8 variables:
#>  $ RawData-Ref Period 2008:Classes 'tbl_df', 'tbl' and 'data.frame': 141 obs. of  85 variables:
#>  $ RawData-Ref Period 2011:Classes 'tbl_df', 'tbl' and 'data.frame': 141 obs. of  39 variables:

To access the relevant sheet, pull out the list element:

wellbeing_toronto_demographics[["IndicatorMetaData"]]
#> # A tibble: 87 x 8
#>    PROVENANCE SHORT_NAME LONG_NAME DESCRIPTION URL   CURRENCY           
#>    <chr>      <chr>      <chr>     <chr>       <chr> <dttm>             
#>  1 Statistic… Total Pop… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  2 Statistic… Pop - Mal… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  3 Statistic… Pop - Fem… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  4 Statistic… Pop 0 - 4… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  5 Statistic… Pop 5 - 9… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  6 Statistic… Pop 6-12 … Total Po… For Refere… http… 2006-05-01 00:00:00
#>  7 Statistic… Pop 10 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#>  8 Statistic… Pop 15 -1… Total Po… For Refere… http… 2011-04-25 00:00:00
#>  9 Statistic… Pop 20 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#> 10 Statistic… Pop  25 -… Total Po… For Refere… http… 2011-04-25 00:00:00
#> # … with 77 more rows, and 2 more variables: DATE_UPDATED <dttm>,
#> #   DOMAIN <chr>

There are also cases where the file contains multiple sheets and it would be helpful to have them all together as a single data set. For example, the 2019 TTC Bus Delay Data:

ttc_bus_delays_2019 <- search_packages("TTC Bus Delay Data") %>%
  list_package_resources() %>%
  filter(name == "ttc-bus-delay-data-2019") %>%
  get_resource()

The result of is a list with an element for every month of data, each of which is a tibble:

str(ttc_bus_delays_2019, max.level = 1)
#> List of 10
#>  $ Jan 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  6743 obs. of  10 variables:
#>  $ Feb 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  6958 obs. of  10 variables:
#>  $ Mar 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  5712 obs. of  10 variables:
#>  $ Apr 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  5144 obs. of  11 variables:
#>  $ May 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  5023 obs. of  10 variables:
#>  $ June 2019 :Classes 'tbl_df', 'tbl' and 'data.frame':  5232 obs. of  10 variables:
#>  $ July 2019 :Classes 'tbl_df', 'tbl' and 'data.frame':  5113 obs. of  10 variables:
#>  $ Aug, 2019 :Classes 'tbl_df', 'tbl' and 'data.frame':  4354 obs. of  10 variables:
#>  $ Sept 2019 :Classes 'tbl_df', 'tbl' and 'data.frame':  3894 obs. of  10 variables:
#>  $ Oct 2019  :Classes 'tbl_df', 'tbl' and 'data.frame':  4283 obs. of  10 variables:

Note that the data for for the element Apr 2019 has one more variable than the rest (11 versus 10):

sapply(ttc_bus_delays_2019, colnames)
#> $`Jan 2019`
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Feb 2019 `
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Mar 2019 `
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Apr 2019`
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident ID" "Incident"    "Delay"       "Gap"         "Direction"  
#> [11] "Vehicle"    
#> 
#> $`May 2019 `
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`June 2019`
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Delay"       "Gap"         "Direction"   "Vehicle"    
#> 
#> $`July 2019`
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Aug, 2019 `
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Sept 2019`
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"    
#> 
#> $`Oct 2019 `
#>  [1] "Report Date" "Route"       "Time"        "Day"         "Location"   
#>  [6] "Incident"    "Min Delay"   "Min Gap"     "Direction"   "Vehicle"

It seems that the Apr 2019 data has gained a variable Incident ID, and that the variables Min Gap and Min Delay, present in all the other months, have been renamed to Gap and Delay, respectively.

We can rename these two variables:

ttc_bus_delays_2019[["Apr 2019"]] <- ttc_bus_delays_2019[["Apr 2019"]] %>%
  rename(`Min Gap` = Gap, `Min Delay` = Delay)

and combine all of the elements into a single tibble using dplyr::bind_rows():

ttc_bus_delays_2019_combined <- bind_rows(ttc_bus_delays_2019)

ttc_bus_delays_2019_combined
#> # A tibble: 52,456 x 13
#>    `Report Date`       Route Time                Day   Location Incident
#>    <dttm>              <dbl> <dttm>              <chr> <chr>    <chr>   
#>  1 2019-01-01 00:00:00    39 1899-12-31 00:13:00 Tues… NECR     Mechani…
#>  2 2019-01-01 00:00:00   111 1899-12-31 00:15:00 Tues… Eglingt… Mechani…
#>  3 2019-01-01 00:00:00    35 1899-12-31 00:18:00 Tues… Finch    Mechani…
#>  4 2019-01-01 00:00:00    25 1899-12-31 00:30:00 Tues… Don Mil… Mechani…
#>  5 2019-01-01 00:00:00    36 1899-12-31 00:40:00 Tues… Humberw… Investi…
#>  6 2019-01-01 00:00:00    45 1899-12-31 00:51:00 Tues… Kipling… Utilize…
#>  7 2019-01-01 00:00:00    32 1899-12-31 01:55:00 Tues… Royal Y… Mechani…
#>  8 2019-01-01 00:00:00    53 1899-12-31 02:19:00 Tues… FSTN     Utilize…
#>  9 2019-01-01 00:00:00   112 1899-12-31 02:33:00 Tues… Kipling… Mechani…
#> 10 2019-01-01 00:00:00    85 1899-12-31 02:57:00 Tues… DONS     Mechani…
#> # … with 52,446 more rows, and 7 more variables: `Min Delay` <dbl>, `Min
#> #   Gap` <dbl>, Direction <chr>, Vehicle <dbl>, `Incident ID` <dbl>,
#> #   Delay <dbl>, Gap <dbl>

Unfortunately, it looks like the Time variable got Excel™ed, and will need some data cleaning.

For interests sake, it appears that Incident ID is a lookup ID for the type of incident – only present in the Apr 2019 data, but interesting nonetheless!

ttc_bus_delays_2019_combined %>%
  filter(!is.na(`Incident ID`)) %>%
  distinct(`Incident ID`, Incident)
#> # A tibble: 11 x 2
#>    `Incident ID` Incident                        
#>            <dbl> <chr>                           
#>  1             5 Investigation                   
#>  2             1 Mechanical                      
#>  3             4 Utilized Off Route              
#>  4             3 Diversion                       
#>  5             9 <NA>                            
#>  6             8 General Delay                   
#>  7             6 Emergency Services              
#>  8            10 Late Leaving Garage - Operator  
#>  9            11 Late Leaving Garage - Mechanical
#> 10             7 Vision                          
#> 11            12 Late Leaving Garage - Management