zipcodeR
is an all-in-one toolkit of functions and data for working with ZIP codes in R.
This document will introduce the tools provided by zipcodeR for improving your workflow when working with ZIP code-level data. The goal of these examples is to help you quickly get up and running with zipcodeR using real-world examples.
First thing's first: zipcodeR
's data & basic search functions are a core component of the package. We'll cover these before showing you how you can implement this package with a real-world example.
The package ships with an offline database containing 24 columns of data for each ZIP code. You can either keep all 24 variables or filter to just one of these depending on what data you need.
The columns of data provided are: zipcode, zipcode_type, major_city, post_office_city, common_city_list, county, state, lat, lng, timezone, radius_in_miles, area_code_list, population, population_density, land_area_in_sqmi, water_area_in_sqmi, housing_units, occupied_housing_units, median_home_value, median_household_income, bounds_west, bounds_east, bounds_north, bounds_south
Let's begin by using zipcodeR to find all ZIP codes within a given state.
Getting all ZIP codes for a single state is simple, you only need to pass a two-digit abbreviation of a state's name to get a tibble of all ZIP codes in that state. Let's start by finding all of the ZIP codes in New York:
search_state('NY')
## 2208 ZIP codes found for state: "NY"
## # A tibble: 2,208 x 24
## zipcode zipcode_type major_city post_office_city common_city_list county
## <chr> <chr> <chr> <chr> <list<raw>> <chr>
## 1 00501 Unique Holtsville <NA> [22] Suffo~
## 2 00544 Unique Holtsville <NA> [22] Suffo~
## 3 06390 PO Box Fishers I~ Fishers Island,~ [32] Suffo~
## 4 10001 Standard New York New York, NY [20] New Y~
## 5 10002 Standard New York New York, NY [34] New Y~
## 6 10003 Standard New York New York, NY [20] New Y~
## 7 10004 Standard New York New York, NY [37] New Y~
## 8 10005 Standard New York New York, NY [35] New Y~
## 9 10006 Standard New York New York, NY [31] New Y~
## 10 10007 Standard New York New York, NY [20] New Y~
## # ... with 2,198 more rows, and 18 more variables: state <chr>, lat <dbl>,
## # lng <dbl>, timezone <chr>, radius_in_miles <dbl>,
## # area_code_list <list<raw>>, population <int>, population_density <dbl>,
## # land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
## # occupied_housing_units <int>, median_home_value <int>,
## # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
## # bounds_north <dbl>, bounds_south <dbl>
What if you only wanted the actual ZIP codes and no other variables? You can use R's dollar sign operator to select one column at a time from the output of zipcodeR
's search functions:
nyzip <- search_state('NY')$zipcode
## 2208 ZIP codes found for state: "NY"
You can also search for ZIP codes in multiple states at once by passing a vector of state abbreviations to the search_states function like so:
states <- c('NY','NJ','CT')
search_state(states)
## 3378 ZIP codes found for states: "NY", "NJ", "CT"
## # A tibble: 3,378 x 24
## zipcode zipcode_type major_city post_office_city common_city_list county
## <chr> <chr> <chr> <chr> <list<raw>> <chr>
## 1 06001 Standard Avon Avon, CT [16] Hartf~
## 2 06002 Standard Bloomfield Bloomfield, CT [22] Hartf~
## 3 06006 Unique Windsor <NA> [19] Hartf~
## 4 06010 Standard Bristol Bristol, CT [19] Hartf~
## 5 06011 PO Box Bristol <NA> [19] Hartf~
## 6 06013 Standard Burlington Burlington, CT [36] Hartf~
## 7 06016 Standard Broad Bro~ Broad Brook, CT [46] Hartf~
## 8 06018 Standard Canaan Canaan, CT [18] Litch~
## 9 06019 Standard Canton Canton, CT [34] Hartf~
## 10 06020 Standard Canton Ce~ Canton Center, ~ [25] Hartf~
## # ... with 3,368 more rows, and 18 more variables: state <chr>, lat <dbl>,
## # lng <dbl>, timezone <chr>, radius_in_miles <dbl>,
## # area_code_list <list<raw>>, population <int>, population_density <dbl>,
## # land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
## # occupied_housing_units <int>, median_home_value <int>,
## # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
## # bounds_north <dbl>, bounds_south <dbl>
This results in a tibble containing all ZIP codes for the states passed to the search_states()
function.
It is also possible to search for ZIP codes located in a particular county within a state.
Let's find all of the ZIP codes located within Ocean County, New Jersey:
search_county('Ocean','NJ')
## [1] "32 ZIP codes found for Ocean County , NJ"
## # A tibble: 32 x 24
## zipcode zipcode_type major_city post_office_city common_city_list county
## <chr> <chr> <chr> <chr> <list<raw>> <chr>
## 1 08005 Standard Barnegat Barnegat, NJ [20] Ocean~
## 2 08006 PO Box Barnegat ~ Barnegat Light,~ [33] Ocean~
## 3 08008 Standard Beach Hav~ Beach Haven, NJ [61] Ocean~
## 4 08050 Standard Manahawkin Manahawkin, NJ [47] Ocean~
## 5 08087 Standard Tuckerton Tuckerton, NJ [51] Ocean~
## 6 08092 Standard West Creek West Creek, NJ [22] Ocean~
## 7 08527 Standard Jackson Jackson, NJ [19] Ocean~
## 8 08533 Standard New Egypt New Egypt, NJ [21] Ocean~
## 9 08701 Standard Lakewood Lakewood, NJ [20] Ocean~
## 10 08721 Standard Bayville Bayville, NJ [20] Ocean~
## # ... with 22 more rows, and 18 more variables: state <chr>, lat <dbl>,
## # lng <dbl>, timezone <chr>, radius_in_miles <dbl>,
## # area_code_list <list<raw>>, population <int>, population_density <dbl>,
## # land_area_in_sqmi <dbl>, water_area_in_sqmi <dbl>, housing_units <int>,
## # occupied_housing_units <int>, median_home_value <int>,
## # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
## # bounds_north <dbl>, bounds_south <dbl>
Sometimes working with county names can be messy and there might not be a 100% match between our database and the name. The search_county()
function can be configured to use base R's agrep
function for these cases via an optional parameter.
One example where this feature is useful comes from the state of Louisiana. Since Louisiana has parishes, their county names don't line up exactly with how other states name their counties.
This example uses approxmiate matching to retrieve all ZIP codes for St. Bernard Parish in Louisiana:
search_county("ST BERNARD","LA", similar = TRUE)$zipcode
## [1] "6 ZIP codes found for St. Bernard Parish , LA or St Bernard Parish , LA"
## [1] "70032" "70043" "70044" "70075" "70085" "70092"
Try running the above code with the similar parameter set to FALSE or not present and you'll receive an error.
What if you already have a dataset containing ZIP codes and want to find out more about that particular area?
Using the reverse_zipcode() function, we can get up to 24 more columns of data when given a ZIP code.
To explore how zipcodeR can enhance your data & workflow, we will use a public dataset from the National Association of Realtors containing data about housing market trends in the United States.
This dataset, which is updated monthly, contains 11102 observations with current housing market data from the National Association of Realtors hosted on Amazon S3
This is what the data we will be working with looks like:
head(real_estate_data)
## # A tibble: 6 x 40
## month_date_yyyy~ postal_code zip_name flag median_listing_~ median_listing_~
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 202011 11766 mount s~ * 584550. 0.063
## 2 202011 16316 conneau~ * 239950 0.372
## 3 202011 96064 montagu~ <NA> 417050 -0.240
## 4 202011 30176 tallapo~ * 349950 0.211
## 5 202011 11798 wyandan~ <NA> 402050 0.072
## 6 202011 75709 tyler, ~ * 363050 -0.081
## # ... with 34 more variables: median_listing_price_yy <dbl>,
## # active_listing_count <dbl>, active_listing_count_mm <dbl>,
## # active_listing_count_yy <dbl>, median_days_on_market <dbl>,
## # median_days_on_market_mm <dbl>, median_days_on_market_yy <dbl>,
## # new_listing_count <dbl>, new_listing_count_mm <dbl>,
## # new_listing_count_yy <dbl>, price_increased_count <dbl>,
## # price_increased_count_mm <dbl>, price_increased_count_yy <dbl>,
## # price_reduced_count <dbl>, price_reduced_count_mm <dbl>,
## # price_reduced_count_yy <dbl>, pending_listing_count <dbl>,
## # pending_listing_count_mm <dbl>, pending_listing_count_yy <dbl>,
## # median_listing_price_per_square_foot <dbl>,
## # median_listing_price_per_square_foot_mm <dbl>,
## # median_listing_price_per_square_foot_yy <dbl>, median_square_feet <dbl>,
## # median_square_feet_mm <dbl>, median_square_feet_yy <dbl>,
## # average_listing_price <dbl>, average_listing_price_mm <dbl>,
## # average_listing_price_yy <dbl>, total_listing_count <dbl>,
## # total_listing_count_mm <dbl>, total_listing_count_yy <dbl>,
## # pending_ratio <dbl>, pending_ratio_mm <dbl>, pending_ratio_yy <dbl>
Note: The data used in this vignette was filtered to only include valid 5-digit ZIP codes as zipcodeR does not yet have a function for normalizing ZIP codes. The full Realtor dataset will have a different number of rows.
We'll focus on the first row for now, which represents the town of Mount Sinai, Ny.
real_estate_data[1,]
## # A tibble: 1 x 40
## month_date_yyyy~ postal_code zip_name flag median_listing_~ median_listing_~
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 202011 11766 mount s~ * 584550. 0.063
## # ... with 34 more variables: median_listing_price_yy <dbl>,
## # active_listing_count <dbl>, active_listing_count_mm <dbl>,
## # active_listing_count_yy <dbl>, median_days_on_market <dbl>,
## # median_days_on_market_mm <dbl>, median_days_on_market_yy <dbl>,
## # new_listing_count <dbl>, new_listing_count_mm <dbl>,
## # new_listing_count_yy <dbl>, price_increased_count <dbl>,
## # price_increased_count_mm <dbl>, price_increased_count_yy <dbl>,
## # price_reduced_count <dbl>, price_reduced_count_mm <dbl>,
## # price_reduced_count_yy <dbl>, pending_listing_count <dbl>,
## # pending_listing_count_mm <dbl>, pending_listing_count_yy <dbl>,
## # median_listing_price_per_square_foot <dbl>,
## # median_listing_price_per_square_foot_mm <dbl>,
## # median_listing_price_per_square_foot_yy <dbl>, median_square_feet <dbl>,
## # median_square_feet_mm <dbl>, median_square_feet_yy <dbl>,
## # average_listing_price <dbl>, average_listing_price_mm <dbl>,
## # average_listing_price_yy <dbl>, total_listing_count <dbl>,
## # total_listing_count_mm <dbl>, total_listing_count_yy <dbl>,
## # pending_ratio <dbl>, pending_ratio_mm <dbl>, pending_ratio_yy <dbl>
The Realtor dataset contains a column named postal_code containing the ZIP code that identifies the town. We'll use this to find out more about Mount Sinai than what is provided in the housing market data.
So far we've covered the functions provided by zipcodeR
for searching ZIP codes across multiple geographies. The package also provides a function for going in reverse, when given a 5-digit ZIP code. Introducing reverse_zipcode()
:
# Get the ZIP code of the first row of data
zip_code <- real_estate_data[1,]$postal_code
# Pass the ZIP code to the reverse_zipcode() function
reverse_zipcode(zip_code)
## 1 row of data found for ZIP code: "11766"
## # A tibble: 1 x 24
## zipcode zipcode_type major_city post_office_city common_city_list county state
## <chr> <chr> <chr> <chr> <list<raw>> <chr> <chr>
## 1 11766 Standard Mount Sin~ Mount Sinai, NY [23] Suffo~ NY
## # ... with 17 more variables: lat <dbl>, lng <dbl>, timezone <chr>,
## # radius_in_miles <dbl>, area_code_list <list<raw>>, population <int>,
## # population_density <dbl>, land_area_in_sqmi <dbl>,
## # water_area_in_sqmi <dbl>, housing_units <int>,
## # occupied_housing_units <int>, median_home_value <int>,
## # median_household_income <int>, bounds_west <dbl>, bounds_east <dbl>,
## # bounds_north <dbl>, bounds_south <dbl>
You may also be interested in relating data at the ZIP code level to Census data. zipcodeR
currently provides a function for getting all Census tracts when provided with a 5-digit ZIP code.
Let's find out how many Census tracts are in the ZIP code from the previous example.
get_tracts(zip_code)
## 9 Census tracts found for ZIP code 11766
## # A tibble: 9 x 3
## ZCTA5 TRACT GEOID
## <chr> <chr> <dbl>
## 1 11766 158202 36103158202
## 2 11766 158205 36103158205
## 3 11766 158304 36103158304
## 4 11766 158306 36103158306
## 5 11766 158308 36103158308
## 6 11766 158309 36103158309
## 7 11766 158320 36103158320
## 8 11766 158322 36103158322
## 9 11766 158323 36103158323
Now that you have all of the tracts for this ZIP code, it would be very easy to join this with other Census data, such as that which is available from the American Community Survey and other sources.
But ZIP codes alone are not terribly useful for social science research since they are only meant to represent USPS service areas. The Census Bureau has established ZIP code tabulation areas (ZCTAs) that provide a representation of ZIP codes and can be used for joining with Census data. But not every ZIP code is also a ZCTA.
zipcodeR
provides a function for testing if a given ZIP code is also a ZIP code tabulation area. When provided with a vector of 5-digit ZIP codes the function will return TRUE or FALSE based upon whether the ZIP code is also a ZCTA.
is_zcta(zip_code)
## [1] TRUE