Introduction

This data set was presented first in Symons, Grimson, and Yuan (1983), analysed with reference to the spatial nature of the data in Cressie and Read (1985), expanded in Cressie and Chan (1989), and used in detail in Cressie (1991). It is for the 100 counties of North Carolina, and includes counts of numbers of live births (also non-white live births) and numbers of sudden infant deaths, for the July 1, 1974 to June 30, 1978 and July 1, 1979 to June 30, 1984 periods. In Cressie and Read (1985), a listing of county neighbours based on shared boundaries (contiguity) is given, and in Cressie and Chan (1989), and in Cressie (1991, 386–89), a different listing based on the criterion of distance between county seats, with a cutoff at 30 miles. The county seat location coordinates are given in miles in a local (unknown) coordinate reference system. The data are also used to exemplify a range of functions in the spatial statistics module user’s manual (Kaluzny et al. 1996).

Getting the data into R

We will be using the spdep and spreg packages, here version: spdep, version 1.3-1, 2023-11-03, the sf package and the tmap package. The data from the sources referred to above is documented in the help page for the nc.sids data set in spData. The actual data, included in a shapefile of the county boundaries for North Carolina were made available in the maptools package 1. These data are known to be geographical coordinates (longitude-latitude in decimal degrees) and are assumed to use the NAD27 datum.

library(spdep)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)

The shapefile format presupposes that you have three files with extensions .shp, .shx, and .dbf, where the first contains the geometry data, the second the spatial index, and the third the attribute data. They are required to have the same name apart from the extension, and are read here using sf::st_read() into the sf object nc; the class is defined in sf. The centroids of the largest polygon in each county are available using the st_centroid method from sf as an sfc POINT object, and can be used to place labels after the extraction of the coordinate matrix:

sf_use_s2(FALSE)
plot(st_geometry(nc), axes=TRUE)
text(st_coordinates(st_centroid(st_geometry(nc), of_largest_polygon=TRUE)), label=nc$FIPSNO, cex=0.5)

We can examine the names of the columns of the data frame to see what it contains — in fact some of the same columns that we will be examining below, and some others which will be useful in cleaning the data set.

names(nc)
##  [1] "CNTY_ID"   "AREA"      "PERIMETER" "CNTY_"     "NAME"      "FIPS"      "FIPSNO"   
##  [8] "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"     "SID79"     "NWBIR79"  
## [15] "east"      "north"     "x"         "y"         "lon"       "lat"       "L_id"     
## [22] "M_id"      "geometry"
summary(nc)
##     CNTY_ID          AREA          PERIMETER         CNTY_          NAME          
##  Min.   :1825   Min.   :0.0420   Min.   :0.999   Min.   :1825   Length:100        
##  1st Qu.:1902   1st Qu.:0.0910   1st Qu.:1.324   1st Qu.:1902   Class :character  
##  Median :1982   Median :0.1205   Median :1.609   Median :1982   Mode  :character  
##  Mean   :1986   Mean   :0.1263   Mean   :1.673   Mean   :1986                     
##  3rd Qu.:2067   3rd Qu.:0.1542   3rd Qu.:1.859   3rd Qu.:2067                     
##  Max.   :2241   Max.   :0.2410   Max.   :3.640   Max.   :2241                     
##      FIPS               FIPSNO         CRESS_ID          BIR74           SID74      
##  Length:100         Min.   :37001   Min.   :  1.00   Min.   :  248   Min.   : 0.00  
##  Class :character   1st Qu.:37050   1st Qu.: 25.75   1st Qu.: 1077   1st Qu.: 2.00  
##  Mode  :character   Median :37100   Median : 50.50   Median : 2180   Median : 4.00  
##                     Mean   :37100   Mean   : 50.50   Mean   : 3300   Mean   : 6.67  
##                     3rd Qu.:37150   3rd Qu.: 75.25   3rd Qu.: 3936   3rd Qu.: 8.25  
##                     Max.   :37199   Max.   :100.00   Max.   :21588   Max.   :44.00  
##     NWBIR74           BIR79           SID79          NWBIR79             east      
##  Min.   :   1.0   Min.   :  319   Min.   : 0.00   Min.   :    3.0   Min.   : 19.0  
##  1st Qu.: 190.0   1st Qu.: 1336   1st Qu.: 2.00   1st Qu.:  250.5   1st Qu.:178.8  
##  Median : 697.5   Median : 2636   Median : 5.00   Median :  874.5   Median :285.0  
##  Mean   :1051.0   Mean   : 4224   Mean   : 8.36   Mean   : 1352.8   Mean   :271.3  
##  3rd Qu.:1168.5   3rd Qu.: 4889   3rd Qu.:10.25   3rd Qu.: 1406.8   3rd Qu.:361.2  
##  Max.   :8027.0   Max.   :30757   Max.   :57.00   Max.   :11631.0   Max.   :482.0  
##      north             x                 y             lon              lat       
##  Min.   :  6.0   Min.   :-328.04   Min.   :3757   Min.   :-84.08   Min.   :33.92  
##  1st Qu.: 97.0   1st Qu.: -60.55   1st Qu.:3920   1st Qu.:-81.20   1st Qu.:35.26  
##  Median :125.5   Median : 114.38   Median :3963   Median :-79.26   Median :35.68  
##  Mean   :122.1   Mean   :  91.46   Mean   :3953   Mean   :-79.51   Mean   :35.62  
##  3rd Qu.:151.5   3rd Qu.: 240.03   3rd Qu.:4000   3rd Qu.:-77.87   3rd Qu.:36.05  
##  Max.   :182.0   Max.   : 439.65   Max.   :4060   Max.   :-75.67   Max.   :36.52  
##       L_id           M_id               geometry  
##  Min.   :1.00   Min.   :1.00   MULTIPOLYGON :100  
##  1st Qu.:1.00   1st Qu.:2.00   epsg:NA      :  0  
##  Median :2.00   Median :3.00   +proj=long...:  0  
##  Mean   :2.12   Mean   :2.67                      
##  3rd Qu.:3.00   3rd Qu.:3.25                      
##  Max.   :4.00   Max.   :4.00

Let’s check the different versions of the data against each other - sf and spData have NC SIDS files, as does GeoDa Center in two forms:

library(sf)
nc_sf <- st_read(system.file("shape/nc.shp", package="sf"),
                 quiet=TRUE)
st_crs(nc_sf)
## Coordinate Reference System:
##   User input: NAD27 
##   wkt:
## GEOGCRS["NAD27",
##     DATUM["North American Datum 1927",
##         ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4267]]
nc <- st_read(system.file("shapes/sids.shp",
                 package="spData"), quiet=TRUE)
st_crs(nc)
## Coordinate Reference System: NA

As the actual CRS is unknown, spData reports missing, although it may very well be +proj=longlat +datum=NAD27

st_crs(nc) <- "+proj=longlat +datum=NAD27"

Next, are the geometries the same? sf::st_equals returns a logical matrix, so we’ll check that the diagonal values are all TRUE, and that only those values are TRUE by summing and recalling that n is 100:

suppressWarnings(st_crs(nc_sf) <- st_crs(nc))
xx <- st_equals(nc, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
## [1] TRUE

Next, let’s download the GeoDa files and repeat the comparisons:

td <- tempdir()
#download.file("https://geodacenter.github.io/data-and-lab//data/sids.zip", file.path(td, "sids.zip"), quiet=TRUE) 
# local copy (2020-10-22) as repository sometimes offline
file.copy(system.file("etc/misc/sids.zip", package="spdep"), td)
## [1] TRUE
unzip(file.path(td, "sids.zip"), c("sids/sids.dbf", "sids/sids.prj", "sids/sids.shp", "sids/sids.shx"), exdir=td)
sids_sf <- st_read(file.path(td, "sids/sids.shp"), quiet=TRUE)
#download.file("https://geodacenter.github.io/data-and-lab//data/sids2.zip", file.path(td, "sids2.zip"), quiet=TRUE)
file.copy(system.file("etc/misc/sids2.zip", package="spdep"), td)
## [1] TRUE
unzip(file.path(td, "sids2.zip"), c("sids2/sids2.dbf", "sids2/sids2.prj", "sids2/sids2.shp", "sids2/sids2.shx"), exdir=td)
sids2_sf <- st_read(file.path(td, "sids2/sids2.shp"), quiet=TRUE)
st_crs(sids_sf)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]
st_crs(sids2_sf)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

It looks as though the external files are assuming WGS84/NAD83 for the datum, but also contain the same geometries.

suppressWarnings(st_crs(sids_sf) <- st_crs(nc_sf))
xx <- st_equals(sids_sf, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
## [1] FALSE
suppressWarnings(st_crs(sids2_sf) <- st_crs(nc_sf))
xx <- st_equals(sids2_sf, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
## [1] FALSE

Now for the contents of the files - sids2 also contains rates, while the file in spData contains the coordinates as given in Cressie (1991), and the parcels of contiguous counties on p. 554, and the aggregations used for median polishing.

all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(sids_sf)[,1:14])
##  [1] "Names: 12 string mismatches"                                  
##  [2] "Component 4: Modes: numeric, character"                       
##  [3] "Component 4: target is numeric, current is character"         
##  [4] "Component 5: 100 string mismatches"                           
##  [5] "Component 6: Modes: character, numeric"                       
##  [6] "Component 6: target is character, current is numeric"         
##  [7] "Component 7: Mean relative difference: 0.9986388"             
##  [8] "Component 8: Mean relative difference: 64.33901"              
##  [9] "Component 9: Mean relative difference: 0.9979786"             
## [10] "Component 10: Mean relative difference: 156.5427"             
## [11] "Component 11: Mean relative difference: 3.01968"              
## [12] "Component 12: Mean relative difference: 0.9980208"            
## [13] "Component 13: Mean relative difference: 160.8194"             
## [14] "Component 14: Modes: numeric, list"                           
## [15] "Component 14: Attributes: < target is NULL, current is list >"
## [16] "Component 14: target is numeric, current is sfc_MULTIPOLYGON"
all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(sids2_sf)[,1:14])
##  [1] "Names: 12 string mismatches"                         
##  [2] "Component 4: Modes: numeric, character"              
##  [3] "Component 4: target is numeric, current is character"
##  [4] "Component 5: 100 string mismatches"                  
##  [5] "Component 6: Modes: character, numeric"              
##  [6] "Component 6: target is character, current is numeric"
##  [7] "Component 7: Mean relative difference: 0.9986388"    
##  [8] "Component 8: Mean relative difference: 64.33901"     
##  [9] "Component 9: Mean relative difference: 0.9979786"    
## [10] "Component 10: Mean relative difference: 156.5427"    
## [11] "Component 11: Mean relative difference: 3.01968"     
## [12] "Component 12: Mean relative difference: 0.9980208"   
## [13] "Component 13: Mean relative difference: 160.8194"    
## [14] "Component 14: Mean relative difference: 0.9984879"

The spData data set has some columns reordered and a surprise:

all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(nc)[,c(2,3,4,1,5:14)])
## [1] "Component \"NWBIR74\": Mean relative difference: 0.04891304"

so a difference in NWBIR74:

which(!(nc_sf$NWBIR74 == nc$NWBIR74))
## [1] 21
c(nc$NWBIR74[21], nc_sf$NWBIR74[21])
## [1] 386 368

where spData follows Cressie (1991) and sf and Geoda follow Cressie and Chan (1989) for NWBIR74 in Chowan county.

We will now examine the data set reproduced from Cressie and collaborators, included in spData (formerly in spdep), and add the neighbour relationships used in Cressie and Chan (1989) to the background map as a graph shown in Figure \(\ref{plot-CC89.nb}\):

gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCR85
## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 492 
## Percentage nonzero weights: 4.92 
## Average number of links: 4.92
gal_file <- system.file("weights/ncCC89.gal", package="spData")[1]
ncCC89 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCC89
## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 394 
## Percentage nonzero weights: 3.94 
## Average number of links: 3.94 
## 2 regions with no links:
## 37055 37095
## 3 disjoint connected subgraphs
plot(st_geometry(nc), border="grey")
plot(ncCC89, st_centroid(st_geometry(nc), of_largest_polygon), add=TRUE, col="blue")