Using readabs

Matt Cowgill

When working with time series data from the Australian Bureau of Statistics (ABS), you must:

  1. Download the data;
  2. Read the data into R; and
  3. Tidy the data.

The readabs package provides functions to help you with each of those steps. One key function will help streamline the process of analysing ABS time series data:

A second function, read_abs_local() is useful if you have already downloaded ABS time series spreadsheet to disk; it imports and tidies the spreadsheets.

The messiness of ABS time series

If you want to visualise or analyse data in R, you will often need to tidy it first. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

ABS time series data is not tidy. Tidying it requires a bit of work. This screenshot of an ABS time series spreadsheet shows some of the problems, namely:

readabs does a lot of the work of tidying these spreadsheets for you, so you can get to your analysis more quickly.

readabs only works with time series

The spreadsheets on the ABS website are divided into one of two categories: time series spreadsheets and data cubes. For example, the main Labour Force release contains both:

The readabs package can download and tidy data contained in ABS time series spreadsheets. It can’t download or tidy any spreadsheet the ABS describes as a ‘data cube’.

How to use read_abs() to get a whole catalogue number

The main function in the package is read_abs(). If you give it an ABS catalogue number, it will download, import and tidy all the time series spreadsheets from that catalogue number. Easy!

For example, to get all the spreadsheets from the Wage Price Index, catalogue number 6345.0, we’d do:

library(readabs)

wpi <- read_abs("6345.0")
#> Finding filenames for tables from ABS catalogue 6345.0
#> Attempting to download files from cat. no. 6345.0, Wage Price Index, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

Cool! Now we’ve got a data frame (a tibble, to be precise) that contains all the time series from the Wage Price Index, converted to long and stacked on top of each other. Here’s what it looks like:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

glimpse(wpi)
#> Observations: 54,261
#> Variables: 12
#> $ table_no         <chr> "634501", "634501", "634501", "634501", "634501…
#> $ sheet_no         <chr> "Data1", "Data1", "Data1", "Data1", "Data1", "D…
#> $ table_title      <chr> "Table 1. Total Hourly Rates of Pay Excluding B…
#> $ date             <date> 1997-09-01, 1997-12-01, 1998-03-01, 1998-06-01…
#> $ series           <chr> "Quarterly Index ;  Total hourly rates of pay e…
#> $ value            <dbl> 67.4, 67.9, 68.5, 68.8, 69.6, 70.0, 70.4, 70.8,…
#> $ series_type      <chr> "Original", "Original", "Original", "Original",…
#> $ data_type        <chr> "INDEX", "INDEX", "INDEX", "INDEX", "INDEX", "I…
#> $ collection_month <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3…
#> $ frequency        <chr> "Quarter", "Quarter", "Quarter", "Quarter", "Qu…
#> $ series_id        <chr> "A2603039T", "A2603039T", "A2603039T", "A260303…
#> $ unit             <chr> "Index Numbers", "Index Numbers", "Index Number…

It’s over 54 000 rows long, and 12 variables wide. Some catalogue numbers are much bigger - for example, if you get the entire monthly Labour Force release (catalogue number 6202.0), you’ll have a data frame with over 2.1 million rows.

All the metadata from the time series spreadsheets is included in the data frame:

The table_no and sheet_no columns will help you if you need to cross-check information on the ABS spreadsheet - table_no matches the filename of the spreadsheet (eg. ‘634501.xls’) and sheet_no is the name of the Excel worksheet within the file that contains the time series.

To omit the metadata from your dataframe, you can run:

wpi_nometadata <- read_abs("6345.0", metadata = FALSE)

If you specify metadata = FALSE, you’ll get a data frame that contains only 6 columns: table_no, sheet_no, table_title, date, series_id, and value.

How to use read_abs() to get individual table(s)

Unless you tell it otherwise, read_abs() will get all the time series spreadsheets from a given catalogue number.

Quite often this will be overkill. Maybe you don’t want all 2.1 million rows of Labour Force data; perhaps you know that the time series you need is in table 1. In that case you can use the tables argument to read_abs() to specify the table(s) you want:


lfs_1 <- read_abs("6202.0", tables = 1)
#> Finding filenames for tables from ABS catalogue 6202.0
#> Attempting to download files from cat. no. 6202.0, Labour Force, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

glimpse(lfs_1)
#> Observations: 55,974
#> Variables: 12
#> $ table_no         <chr> "6202001", "6202001", "6202001", "6202001", "62…
#> $ sheet_no         <chr> "Data1", "Data1", "Data1", "Data1", "Data1", "D…
#> $ table_title      <chr> "Table 1. Labour force status by Sex, Australia…
#> $ date             <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01…
#> $ series           <chr> "Employed total ;  Persons ;", "Employed total …
#> $ value            <dbl> 6008.345, 6015.288, 6021.676, 6027.323, 6031.31…
#> $ series_type      <chr> "Trend", "Trend", "Trend", "Trend", "Trend", "T…
#> $ data_type        <chr> "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "S…
#> $ collection_month <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ frequency        <chr> "Month", "Month", "Month", "Month", "Month", "M…
#> $ series_id        <chr> "A84423127L", "A84423127L", "A84423127L", "A844…
#> $ unit             <chr> "000", "000", "000", "000", "000", "000", "000"…

If you want more than one table, but not the whole catalogue number, you can specify multiple tables:


lfs_1_5 <- read_abs("6202.0", tables = c(1, 5))
#> Finding filenames for tables from ABS catalogue 6202.0
#> Attempting to download files from cat. no. 6202.0, Labour Force, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

glimpse(lfs_1_5)
#> Observations: 97,218
#> Variables: 12
#> $ table_no         <chr> "6202001", "6202001", "6202001", "6202001", "62…
#> $ sheet_no         <chr> "Data1", "Data1", "Data1", "Data1", "Data1", "D…
#> $ table_title      <chr> "Table 1. Labour force status by Sex, Australia…
#> $ date             <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01…
#> $ series           <chr> "Employed total ;  Persons ;", "Employed total …
#> $ value            <dbl> 6008.345, 6015.288, 6021.676, 6027.323, 6031.31…
#> $ series_type      <chr> "Trend", "Trend", "Trend", "Trend", "Trend", "T…
#> $ data_type        <chr> "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "S…
#> $ collection_month <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ frequency        <chr> "Month", "Month", "Month", "Month", "Month", "M…
#> $ series_id        <chr> "A84423127L", "A84423127L", "A84423127L", "A844…
#> $ unit             <chr> "000", "000", "000", "000", "000", "000", "000"…

The tables argument can be either a numeric vector (eg. c(1, 5)) or a character vector (eg. c(“1”, “5a”)).

I’ve imported the data… now what?

Because read_abs() does the work for you of getting your data in a tidy (long) format, it’s easy to filter to the data you’re interested in and generate output (like graphs) using the Tidyverse packages, such as dplyr and ggplot2.

In this example, we’ll work with data from the Labour Force survey that we downloaded and tidied earlier using read_abs(). First, load the packages you need:

Now let’s have a look at the time series from table 1 of the Labour Force survey:

unique(lfs_1$series)
#>  [1] "Employed total ;  Persons ;"                                      
#>  [2] "Employed total ;  > Males ;"                                      
#>  [3] "Employed total ;  > Females ;"                                    
#>  [4] "> Employed full-time ;  Persons ;"                                
#>  [5] "> Employed full-time ;  > Males ;"                                
#>  [6] "> Employed full-time ;  > Females ;"                              
#>  [7] "> Employed part-time ;  Persons ;"                                
#>  [8] "> Employed part-time ;  > Males ;"                                
#>  [9] "> Employed part-time ;  > Females ;"                              
#> [10] "Employment to population ratio ;  Persons ;"                      
#> [11] "Employment to population ratio ;  > Males ;"                      
#> [12] "Employment to population ratio ;  > Females ;"                    
#> [13] "Unemployed total ;  Persons ;"                                    
#> [14] "Unemployed total ;  > Males ;"                                    
#> [15] "Unemployed total ;  > Females ;"                                  
#> [16] "> Unemployed looked for full-time work ;  Persons ;"              
#> [17] "> Unemployed looked for full-time work ;  > Males ;"              
#> [18] "> Unemployed looked for full-time work ;  > Females ;"            
#> [19] "> Unemployed looked for only part-time work ;  Persons ;"         
#> [20] "> Unemployed looked for only part-time work ;  > Males ;"         
#> [21] "> Unemployed looked for only part-time work ;  > Females ;"       
#> [22] "Unemployment rate ;  Persons ;"                                   
#> [23] "Unemployment rate ;  > Males ;"                                   
#> [24] "Unemployment rate ;  > Females ;"                                 
#> [25] "> Unemployment rate looked for full-time work ;  Persons ;"       
#> [26] "> Unemployment rate looked for full-time work ;  > Males ;"       
#> [27] "> Unemployment rate looked for full-time work ;  > Females ;"     
#> [28] "> Unemployment rate looked for only part-time work ;  Persons ;"  
#> [29] "> Unemployment rate looked for only part-time work ;  > Males ;"  
#> [30] "> Unemployment rate looked for only part-time work ;  > Females ;"
#> [31] "Labour force total ;  Persons ;"                                  
#> [32] "Labour force total ;  > Males ;"                                  
#> [33] "Labour force total ;  > Females ;"                                
#> [34] "Participation rate ;  Persons ;"                                  
#> [35] "Participation rate ;  > Males ;"                                  
#> [36] "Participation rate ;  > Females ;"                                
#> [37] "Not in the labour force (NILF) ;  Persons ;"                      
#> [38] "Not in the labour force (NILF) ;  > Males ;"                      
#> [39] "Not in the labour force (NILF) ;  > Females ;"                    
#> [40] "Civilian population aged 15 years and over ;  Persons ;"          
#> [41] "Civilian population aged 15 years and over ;  > Males ;"          
#> [42] "Civilian population aged 15 years and over ;  > Females ;"

OK! There’s a bunch of data in here. Let’s make a data frame that just contains the male and female unemployment rates over time, using the seasonally adjusted time series. I’ll use the grepl() function from base R to help filter the data frame so it only contains rows I’m interested in.


unemp <- lfs_1 %>%
  filter(grepl("Unemployment rate", series))

unique(unemp$series)
#> [1] "Unemployment rate ;  Persons ;"                                   
#> [2] "Unemployment rate ;  > Males ;"                                   
#> [3] "Unemployment rate ;  > Females ;"                                 
#> [4] "> Unemployment rate looked for full-time work ;  Persons ;"       
#> [5] "> Unemployment rate looked for full-time work ;  > Males ;"       
#> [6] "> Unemployment rate looked for full-time work ;  > Females ;"     
#> [7] "> Unemployment rate looked for only part-time work ;  Persons ;"  
#> [8] "> Unemployment rate looked for only part-time work ;  > Males ;"  
#> [9] "> Unemployment rate looked for only part-time work ;  > Females ;"

Now we have a data frame, unemp, that contains various unemployment rate series. Let’s drop the ones that refer to all persons, or refer to people looking for full-time or part-time work:


unemp <- unemp %>%
  filter(grepl("Males", series) | grepl("Females", series)) %>%
  filter(!grepl("looked for", series)) 

unique(unemp$series)
#> [1] "Unemployment rate ;  > Males ;"   "Unemployment rate ;  > Females ;"

Now our data frame only contains the male and female unemployment rates, which is what we want. Let’s graph it, filtering once more to show only the seasonally adjusted series and adding a ‘sex’ column:

library(ggplot2)

unemp %>%
  filter(series_type == "Seasonally Adjusted") %>%
  mutate(sex = if_else(grepl("Males", series), "Males", "Females")) %>%
  ggplot(aes(x = date, y = value, col = sex)) +
  geom_line() +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.title = element_blank(),
        legend.title = element_blank(),
        text = element_text(size = 5)) +
  labs(title = "The male and female unemployment rates have converged",
       subtitle = "Unemployment rates for Australian men and women (aged 15+), 1978-2018 (per cent)",
       caption = "Source: ABS 6202.0")

Ta-da! Now we’ve got a nice little ggplot2 graph - and you didn’t need to go to the ABS website or click around in Excel.

Why is my hard drive filling up with spreadsheets?

The read_abs() function downloads spreadsheets from the ABS website, then loads them into R, then tidies them. By default, the spreadsheets will be saved in a data/ABS subdirectory of your working directory. You can change this location using the path argument to read_abs().

What if I’ve already downloaded spreadsheets?

If you already have ABS time series spreadsheets saved locally that you want to read, the read_abs_local() function is what you want.

If you don’t just run read_abs_local() without any arguments, it will look in the data/ABS subdirectory of your working directory and attempt to read any .xls files located there. If you want to read all the files from a different directory, specify it using the path argument.

If you want to read a particular table, or tables, specify them using the filenames argument, like this:


lfs_local <- read_abs_local(c("6202001.xls", "6202005.xls"))
#> Extracting data from locally-saved spreadsheets
#> Tidying data from imported ABS spreadsheets

The data frame you’ll get will look the same as if you’d used read_abs() to get the spreadsheet(s) from the ABS website.