Using the censusxy Package

Branson Fox and Christopher Prener, Ph.D.

2019-08-09

Overview

The censusxy package is designed to provide easy and efficient access to the U.S. Census Bureau Batch Geocoder in R.

Motivation

The censusxy package has been developed specifically with large data sets in mind. There are other implementations for accessing the Census Bureau’s API for geocoding in R (e.g. the censusr package) that require iteration to geocode multiple addresses at once. censusxy, on the other hand, is designed to operate on on a column of addresses in a data frame or tibble object. Additionally, the Census Bureau caps the number of addresses that can be sent to the API in a single call at 1,000. If a data set exceeds 1,000 unique addresses, it will be automatically subset into appropriately sized API calls, geocoded, and then put back together so that a single object is returned. The package therefore provides an efficient solution to batch geocoding via the Census Bureau’s services.

Responsible Use

The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.

Installation

We recommend that users install sf before proceeding with the installation of censusxy. Windows users should be able to install sf without significant issues, but macOS and Linux users will need to install several open source spatial libraries to get sf itself up and running. The easiest approach for macOS users is to install the GDAL 2.0 Complete framework from Kyng Chaos.

For Linux users, steps will vary based on the flavor being used. Our configuration file for Travis CI and its associated bash script should be useful in determining the necessary components to install.

Once sf is installed, the easiest way to get censusxy is to install it from CRAN:

install.packages("censusxy")

The development version of censusxy can be accessed from GitHub with remotes:

# install.packages("remotes")
remotes::install_github("slu-openGIS/censusxy")

Tips

The key function, cxy_geocode, supports non-standard evaluation, meaning you can use either quoted or unquoted inputs for arguments that refer to variable names.

Workflow

Data

This implementation assumes that your data are contained in a data.frame or tibble, and that address data are split into a number of component variables: street address, city, state, and five digit zip code. If your data are not split into components, the authors recommend the package postmastr for street address parsing. Not all components are required. For example, the sample homicide data included in the package lack zip code data. However, the more components you have, the better your results will be. Both sample data objects in this package present data as they should be formatted for geocoding.

Usage

This package contains a single exported function, cxy_geocode(). The only required arguments are .data for the data.frame or tibble containing address data, and address specifying the column name containing street addresses. The function supports non-standard evaluation, meaning you do not need to quote arguments for column names.

results <- cxy_geocode(stl_homicides, address = street_address)

However, it is highly recommended that you include city, state and zip code as well. Doing so will increase speed and accuracy significantly. The homicide data contain city and state data as well, so the preferred call for these data would be:

results <- cxy_geocode(stl_homicides, address = street_address, city = city, state = state)

Finally, two output types are supported. By default, a tibble is returned (output = "tibble") with a minimal set of variables that describe the accuracy of a given observation’s geocoding (style = "minimal"). A complete set of values returned by the API for each observation can be obtained by using style = "full". Alternatively, an sf object can be returned with the geocoded data projected using the WGS 1984 geographic coordinate system:

homicide_sf <- cxy_geocode(stl_homicides, id, street_address, city, state, postal_code, output = "sf")

Note, however, that it returns only matched addresses, including those approximated by street length. If there are unmatched addresses, they will be dropped from the output. Use output = "tibble" to return all addresses, including those that are unmatched.

Output returned as an sf object can be previewed with a package like mapview:

> mapview::mapview(homicide_sf)

Timeout

The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.

If a batch times out, the next 1000 addresses will be attempted.

Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. macOS users may find the app caffeine useful.

Getting Help