## An R Package for Geostatistical Sampling Designs with OSM data

### Henry Crosby, Godwin Yeobah, J. Porto De Albuquerque

Abstract: We introduce a new R package, OSMGeosample, for constructing geostatistical sampling designs with OSM data. The new package implements random and spatially-regulated sampling designs. We illustrate the use of the package using a boundary dataset and OSM building data for a slum in Idikan. In addition, we provide the ability to sample on user-defined locations which can be joined to OSM features for reproducibility and the extension of information. The user can sample OSM features of one or more data type (point, line, polygon etc) and one or more value and key pairs.

Keywords: inhibitory sampling designs, Geostatistics, R

Address: Institute for Global Sustainable Development, University of Warwick, Coventry, CV4 7AL, UK

## Introduction

We provide spatial sampling designs for investigating unobserved spatial phenomena, i.e., identifying properties to select for a survey. We create a randomized inhibitory spatial sampling functions to address the problem of spatial prediction (Chipeta et al, 2016). Two functions run inhibitory designs plus close pairs' on discrete and continuous data. In an inhibitory design, any pair of sample locations must be separated by at least an inhibition distance 𝛿. In addition, a third function runs random sampling for comparison.

The motivation for creating the methods in this package is undertake valid spatially-regulated sampling and to conduct a survey in irregular (urban) environments. In fact, in our example our sampling frame is the complete set of geo-located household locations in a slum in Idikan (Watson et al, 2019).

Unlike the Geosample package, the user is able to define a boundary to extract OSM features or use their own data to spatially join it to OSM features. In addition, the output table provides a list of the original and OSM ids with the (centroid) longitudinal and latitudinal values. Furthermore, the user is provided with an interactive map powered by the mapview package which shows the full dataset (in red) alongside the sampled dataset (in yellow) and the defined boundary or features.

## Installing OSM Geosample

If you update the sf package and then try to install osmgeosample it will error “Error: sf was built under R version …”. If you don’t have the latest version of R, updating the version will fix this issue. Alternatively You can get around this by setting

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")


An example script of how to install the package from GITHub.

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")
setwd("/PATH/TO/PACKAGE/")
devtools::install_local("osmgeosample-master.zip")


## OSM Random Sampling (osm_random_sample)

The osm_random_sample function creates a spatially random sample from either (1) a discrete set of OSM features defined in the function parameters or (2) a continuous surface defined by a user specified geographical region. The OSM features can be a subset of all features based on OSM key and value pairs (i.e., buildings, cafes, restaurants etc). The features can also be a subset of all data based on the feature type (i.e., point, polygon etc). An interactive map and data.frame are returned, both showing the full dataset, geographical locations and sampled data.

### Parameters:

bounding_geom: an sf or sp object (with N \geq size) where each line corresponds to one spatial location. It should contain values of 2D coordinates, data and, optionally, covariate(s) value(s) at the locations. This argument must be provided when sampling from a 'discrete' set of points, see 'type' below for details. This should only be provided when boundary_or_feature = 'boundary'.

dis_or_cont: a string of either 'discrete', from a set of N potential sampling points or 'continuum' from independent, 'completely' random points.

sample_size: a non-negative integer giving the total number of locations to be sampled.

plotit: a 'logical' specifying if graphical output is required. Default is plotit = TRUE.

plotit_leaflet: a 'logical' specifying if leaflet (html) graphical output is required. This is prioritised over plotit if both are selected. Default is plotit_leaflet = TRUE.

boundary: a categorical variable to determine whether the exact boundary (boundary = 0), bounding box (boundary = 1) or a buffer around the boundary boundary = 2) is used for sampling. Default is boundary = 0.

buff_dist: if boundary = 2 then this value determines the size of the buffer by distance. The default is buff_dist is NULL).

buff_epsg: if boundary = 2 then this value determines the local geographic grid reference so that the buffer can be calculated in meters. The default is buff_epsg = 4326) which will use decimal degrees instead of meters. As an example, 27700 relates to the British National Grid.

join_type: a text value to determine how to spatially join all features with the boundary. The options are 'within' or 'intersect'.

key: A feature key as defined in OSM. An example is 'building'.

value: a value for a feature key ('key'); can be negated with an initial exclamation mark, value = '!this', and can also be a vector, value = c ('this', 'that'). An example is 'cafe'.

data_return: specifies what data types (as specified in OSM) you want returned. More than one can be selected. The options are 'osm_polygons', 'osm_points', 'osm_multipolygons','osm_multilines','osm_lines'.

boundary_or_feature: specifies whether the user inputs a boundary or a set of user-inputted features. For example if the user selects “boundary”, they can provide a spatial data frame or OSM locality which will query the osm features within that boundary or locality. If the user selects “feature” then they can provide a data frame of features that they want to sample.

feature_geom: is a user inputted data frame of features that are required to be sampled. This should only be provided when boundary_or_feature = 'feature'.

join_features_to_osm: is a TRUE or FALSE variable which allows the user to specify whether they want their feature geom to be spatially joined to OSM features. The output sampling data frame will have an additional column showing the joined 'OSM_id'. The interactive map will also plot the OSM features.

### Returns

(1) A data.frame object named 'results' of length n containing the final sampled ids (osm, original or both), centroid longitude and latitude and a 0/1 value determining whether the instance is in the selected sample (named inSample), if sampling from a 'discrete' set of points. A data.frame object of dimension n by 3 containing the serial id and centroid locations for all sample instances,if sampling from a 'continuum'.

(2) An interactive map powered by mapview showing the boundary, full dataset and sampled data.

### Example

library(sp)
bounding_geom<-
SpatialPolygonsDataFrame(
SpatialPolygons(list(Polygons(list(Polygon(
cbind(
c(3.888959,3.888744,3.888585,3.888355,3.887893,3.887504,3.886955,3.886565,3.886303,3.886159,
3.885650,3.885650,3.885595,3.885404,3.885444,3.885897,3.886692,3.887241,3.888068,3.888323,
3.888697,3.889150,3.889548,3.889890,3.890184,3.890828,3.891258,3.891807,3.892061,3.892292,
3.892689,3.893294,3.893008,3.893676,3.888959),
c(7.379483,7.379785,7.380024,7.380294,7.380629,7.380986,7.381448,7.381861,7.382243,7.382474,
7.383277,7.383468,7.383890,7.384263,7.384669,7.385258,7.385313,7.385194,7.384868,7.384900,
7.385051,7.385067,7.384955,7.384749,7.384526,7.384120,7.384009,7.384080,7.384430,7.384478,
7.384629,7.384772,7.383269,7.380963,7.379483)))), ID=1))),
data.frame( ID=1))
proj4string(bounding_geom) <- CRS('+proj=longlat +datum=WGS84')

set.seed(15892)
xy.sample <- osm_random_sample(buff_dist=NULL, bounding_geom = bounding_geom,
key= 'building', value = NULL, boundary = 0,
buff_epsg = NULL, join_type = 'intersect',
dis_or_cont = 'discrete', sample_size = 70,
plotit = TRUE, plotit_leaflet = TRUE,
data_return= c('osm_polygons'))


## OSM Inhibitory Sampling with Discrete Data (osm_discrete_inhibit_sample)

osm_discrete_inhibit_sample creates a spatially discrete sample from a specified set of OSM spatial locations within a polygonal sampling region using 'inhibitory plus close pairs'. The OSM features can be a subset all features based on OSM key and value pairs (i.e., buildings, cafes, restaurants etc). The features can also be a of all data based on the feature type, i.e., point, polygon etc. An interactive map and data.frame are both returned, both showing the full dataset, geographical locations and sampled data.

### Parameters

bounding_geom: an sf or sp object (with N \geq size) where each line corresponds to one spatial location. It should contain values of 2D coordinates, data and, optionally, covariate(s) value(s) at the locations. This argument must be provided when sampling from a 'discrete' set of points, see 'type' below for details. This should only be provided when boundary_or_feature = 'boundary'.

delta: The minimum permissible distance between any two locations in preliminary sample. This can be allowed to vary with number of {'close pairs'} if a \bold{simple inhibitory} design is compared to one of the \bold{inhibitory plus close pairs} design.

delta.fix: A 'logical' input which specifies whether {'delta'} is fixed or allowed to vary with number of close pairs {k}. Default is {delta.fix = FALSE}.

k: The number of close-pair locations in the sample. It must be an integer between 0 and {size}/2.

cp.criterion: The criterion for choosing close pairs {k}. The {'cp.zeta'} criterion chooses locations not included in the initial sample, from the uniform distribution of a disk with radius {'zeta'} (NB: {zeta} argument must be provided for this criterion). The {'cp.neighb'} criterion chooses nearest neighbours amongst locations not included in the initial sample ({'zeta'} becomes trivial for {'cp.neighb'} criterion).

zeta: The maximum permissible distance (radius of a disk with center {x{*}_{j}, j = 1, \ldots, k}) within which a close-pair point is placed. See \bold{Details}.

ntries: The number of rejected proposals after which the algorithm terminates.

sample_size: a non-negative integer giving the total number of locations to be sampled.

plotit: a 'logical' specifying if graphical output is required. Default is plotit = TRUE.

plotit_leaflet: a 'logical' specifying if leaflet (html) graphical output is required. This is prioritised over plotit if both are selected. Default is plotit_leaflet = TRUE.

boundary: a categorical variable to determine whether the exact boundary (boundary = 0), bounding box (boundary = 1) or a buffer around the boundary boundary = 2) is used for sampling. Default is boundary = 0.

buff_dist: if boundary = 2 then this value determines the size of the buffer by distance. The default is buff_dist is NULL).

buff_epsg: if boundary = 2 then this value determines the local geographic grid reference so that the buffer can be calculated in meters. The default is buff_epsg = 4326) which will use decimal degrees instead of meters. As an example, 27700 relates to the British National Grid.

join_type: a text value to determine how to spatially join all features with the boundary. The options are 'within' or 'intersect'.

key: A feature key as defined in OSM. An example is 'building'.

value: a value for a feature key ('key'); can be negated with an initial exclamation mark, value = '!this', and can also be a vector, value = c ('this', 'that'). An example is 'cafe'.

data_return: specifies what data types (as specified in OSM) you want returned. More than one can be selected. The options are 'osm_polygons', 'osm_points', 'osm_multipolygons','osm_multilines','osm_lines'.

boundary_or_feature: specifies whether the user inputs a boundary or a set of user-inputted features. For example if the user selects “boundary”, they can provide a spatial data frame or OSM locality which will query the osm features within that boundary or locality. If the user selects “feature” then they can provide a data frame of features that they want to sample.

feature_geom: is a user inputted data frame of features that are required to be sampled. This should only be provided when boundary_or_feature = 'feature'.

join_features_to_osm: is a TRUE or FALSE variable which allows the user to specify whether they want their feature geom to be spatially joined to OSM features. The output sampling data frame will have an additional column showing the joined 'OSM_id'. The interactive map will also plot the OSM features.

### Returns

(1) A data.frame object named 'results' of length n containing the final sampled ids (osm, original or both), centroid longitude and latitude and a 0/1 value determining whether the instance is in the selected sample (named inSample)..

(2) An interactive map powered by mapview showing the boundary, full dataset and sampled data.

### Details (Chipeta, 2019)

To draw a sample of size n from a population of spatial locations X_{i} : i = 1,…,N, with the property that the distance between any two sampled locations is at least delta, the function implements the following algorithm.

Step 1. Draw an initial sample of size n completely at random and call this x_i : i = 1,…,n.

Step 2. Set i = 1.

Step 3. Calculate the smallest distance, d_min, from x_i to all other x_j in the initial sample.

Step 4. If d_min >= delta}, increase i by 1 and return to step 2 if {i <= n}, otherwise stop.

Step 5. If d_min < delta, draw an integer j at random from 1, 2,..,N, set x_i = X_j and return to step 3.

Samples generated in this way exhibit more regular spatial arrangements than random samples of the same size. The degree of regularity achievable will be influenced by the spatial arrangement of the population (X_i : i = 1,…,N), the specified value of delta and the sample size, n. For any given population, if n and/or delta is too large, a sample of the required size with the distance between any two sampled locations at least delta will not be achievable; the algorithm will then find n_s < n points that can be placed for the given parameters.

Sampling close pairs of points.

For some purposes, typically when using the same sample for parameter estimation and spatial prediction, it is desirable that a spatial sampling scheme include pairs of closely spaced points (x). The OSMGeosample offers two ways of specifying close pairs, either as the closest available unsampled point to an existing sampled point (cp.critetrion = cp.neighb), or as a random choice from all available unsampled points within distance zeta of an existing sampled point (cp.criterion = cp.zeta). The algorithm proceeds as follows.

Let k be the required number of close pairs.

Step 1. Construct a simple inhibitory design SI(n - k, \delta).

Step 2. Sample k from x1,…, x{n-k} without replacement and call this set x_j : j = 1,…, k}.

Step 3. For each xj: j = 1,…, k}, select a close pair x{n-k+j} according to the specified criterion.

### Notes

(1) Depending on the spatial configuration of potential sampling locations and, when the selection criterion (cp.criterion = cp.zeta), the specified value of zeta, it is possible that one or more of the selected points x_j in Step 2 will not have an eligible “close pair''. In this case, the algorithm will try find an alternative x_j and report a warning if it fails to do so.

(2) If 'delta' is set to 0, a completely random sample is generated. In this case, 'close pairs' are not permitted and 'zeta' becomes trivial.

### Example

 @examples
library(sp)
bounding_geom<-
SpatialPolygonsDataFrame(
SpatialPolygons(list(Polygons(list(Polygon(
cbind(
c(3.888959,3.888744,3.888585,3.888355,3.887893,3.887504,3.886955,3.886565,3.886303,3.886159,
3.885650,3.885650,3.885595,3.885404,3.885444,3.885897,3.886692,3.887241,3.888068,3.888323,
3.888697,3.889150,3.889548,3.889890,3.890184,3.890828,3.891258,3.891807,3.892061,3.892292,
3.892689,3.893294,3.893008,3.893676,3.888959),
c(7.379483,7.379785,7.380024,7.380294,7.380629,7.380986,7.381448,7.381861,7.382243,7.382474,
7.383277,7.383468,7.383890,7.384263,7.384669,7.385258,7.385313,7.385194,7.384868,7.384900,
7.385051,7.385067,7.384955,7.384749,7.384526,7.384120,7.384009,7.384080,7.384430,7.384478,
7.384629,7.384772,7.383269,7.380963,7.379483)))), ID=1))),
data.frame( ID=1))
proj4string(bounding_geom) <- CRS('+proj=longlat +datum=WGS84')

set.seed(15892)
xy.sample <- osm_discrete_inhibit_sample(bounding_geom=bounding_geom,
data_return=c('osm_polygons'),boundary=0, buff_dist=NULL, buff_epsg=NULL,
join_type='within', sample_size=70, plotit=TRUE, plotit_leaflet = TRUE,
delta = 5, key ='building', value=NULL, delta.fix = TRUE, k = 0,
cp.criterion = 'cp.neighb', zeta = 0.025, ntries = 5)


## OSM Inhibitory Sampling with Continuous Data (osm_contin_inhibit_sample)

osm_contin_inhibit_sample creates a spatially continous sample of locations within a polygonal sampling using inhibitory plus close pairs' (Chipeta, 2019). The region can be defined using OSM data or a user defined polygon or features. The OSM features can be a subset all features based on OSM key and value pairs (i.e., buildings, cafes, restaurants etc). The features can also be a of all data based on the feature type, i.e., point, polygon etc. An interactive map and data.frame are both returned, both showing the full dataset, geographical locations and sampled data.

### Parameters

bounding_geom: a {sf} or {sp} object (with {N \geq {size}}) where each line corresponds to one spatial location. It should contain values of 2D coordinates, data and, optionally, covariate(s) value(s) at the locations. This argument must be provided when sampling from a {'discrete'} set of points, see {'type'} below for details.

sample_size: a non-negative integer giving the total number of locations to be sampled.

plotit: a 'logical' specifying if graphical output is required. Default is plotit = TRUE.

plotit_leaflet: a 'logical' specifying if leaflet (html) graphical output is required. This is prioritised over plotit if both are selected. Default is plotit_leaflet = TRUE.

boundary: a categorical variable to determine whether the exact boundary (boundary = 0), bounding box (boundary = 1) or a buffer around the boundary boundary = 2) is used for sampling. Default is boundary = 0.

buff_dist: if boundary = 2 then this value determines the size of the buffer by distance. The default is buff_dist is NULL).

buff_epsg: if boundary = 2 then this value determines the local geographic grid reference so that the buffer can be calculated in meters. The default is buff_epsg = 4326) which will use decimal degrees instead of meters. As an example, 27700 relates to the British National Grid.

join_type: a text value to determine how to spatially join all features with the boundary. The options are 'within' or 'intersect'.

key: A feature key as defined in OSM. An example is 'building'.

value: a value for a feature key ('key'); can be negated with an initial exclamation mark, value = '!this', and can also be a vector, value = c ('this', 'that'). An example is 'cafe'.

data_return: specifies what data types (as specified in OSM) you want returned. More than one can be selected. The options are 'osm_polygons', 'osm_points', 'osm_multipolygons','osm_multilines','osm_lines'.

boundary_or_feature: specifies whether the user inputs a boundary or a set of user-inputted features. For example if the user selects "boundary”, they can provide a spatial data frame or OSM locality which will query the osm features within that boundary or locality. If the user selects “feature” then they can provide a data frame of features that they want to sample.

feature_geom: is a user inputted data frame of features that are required to be sampled. This should only be provided when boundary_or_feature = 'feature'.

join_features_to_osm: is a TRUE or FALSE variable which allows the user to specify whether they want their feature geom to be spatially joined to OSM features. The output sampling data frame will have an additional column showing the joined 'OSM_id'. The interactive map will also plot the OSM features.

delta: minimum permissible distance between any two locations in preliminary sample. This can be allowed to vary with the number of {'close pairs'} if a \bold{simple inhibitory} design is compared to one of the \bold{inhibitory plus close pairs} design.

delta.fix: 'logical' specifies whether {delta} is fixed or allowed to vary with number of close pairs {k}. Default is {delta.fix = FALSE}.

k: number of locations in preliminary sample to be replaced by near neighbours of other preliminary sample locations to form {close pairs} (integer between 0 and {size/2}). A \bold{simple inhibitory} deisgn is generated when {k = 0}.

rho: maximum distance between the two locations in a {'close-pair'}.

ntries: number of rejected proposals after which the algorithm will terminate.

### Details (Chipeta, 2019)

To draw a simple inhibitory (SI) sample of size n from a spatially continuous region (A), with the property that the distance between any two sampled locations is at least delta, the following algorithm is used.

Step 1. Set i = 1 and generate a point x_1 uniformly distributed on D.

Step 2. Generate a point x uniformly distributed on D and calculate the minimum, d_min, of the distances from x_i to all x_j: j <= i.

Step 3. If d_min >= delta, increase i by 1, set x_i = x and return to step 2 if i <= n}, otherwise stop;

Step 4. If d_min < delta, return to step 2 without increasing i.

\bold{Sampling close pairs of points.}

For some purposes, it is desirable that a spatial sampling scheme include pairs of closely spaced points, resulting in an inhibitory plus close pairs (ICP) design. In this case, the above algorithm requires the following additional steps to be taken. Let k be the required number of close pairs. Choose a value rho such that a close pair of points will be a pair of points separated by a distance of at most {rho}.

Step 5. Set j = 1 and draw a random sample of size 2 from integers 1, 2,…, n, say (i_1, i_2);

Step 6. Replace x_i_1 by x_i_2 + u, where u is uniformly distributed on the disc with centre x_i_2 and radius (rho), increase i by 1 and return to step 5 if i <= k, otherwise stop.

When comparing a SI design to one of the ICP designs, the inhibitory components should have the same degree of spatial regularity. This requires delta to become a function of k namely delta_k = delta_0*(sqrt{n/(n - k)}) with delta_0 held.

### Returns

(1) A data.frame object named 'results' of length n containing the final sampled osmgeosample_ids, centroid longitude and latitude.

(2) An interactive map powered by mapview showing the boundary, full dataset and sampled data.

### Notes

If 'delta' is set to 0, a completely random sample is generated. In this case, 'close pairs' are not permitted and rho is irrelevant.

### Example:

 library(sp)
bounding_geom<-
SpatialPolygonsDataFrame(
SpatialPolygons(list(Polygons(list(Polygon(
cbind(
c(3.888959,3.888744,3.888585,3.888355,3.887893,3.887504,3.886955,3.886565,3.886303,3.886159,
3.885650,3.885650,3.885595,3.885404,3.885444,3.885897,3.886692,3.887241,3.888068,3.888323,
3.888697,3.889150,3.889548,3.889890,3.890184,3.890828,3.891258,3.891807,3.892061,3.892292,
3.892689,3.893294,3.893008,3.893676,3.888959),
c(7.379483,7.379785,7.380024,7.380294,7.380629,7.380986,7.381448,7.381861,7.382243,7.382474,
7.383277,7.383468,7.383890,7.384263,7.384669,7.385258,7.385313,7.385194,7.384868,7.384900,
7.385051,7.385067,7.384955,7.384749,7.384526,7.384120,7.384009,7.384080,7.384430,7.384478,
7.384629,7.384772,7.383269,7.380963,7.379483)))), ID=1))),
ID=1))),
data.frame( ID=1))
proj4string(bounding_geom) <- CRS('+proj=longlat +datum=WGS84')

set.seed(15892)
osm_contin_inhibit_sample(bounding_geom = bounding_geom, boundary = 0, buff_dist=NULL,
buff_epsg = NULL, sample_size = 50, plotit = TRUE, plotit_leaflet = TRUE,
delta=50, delta.fix = FALSE,k=7,rho=1, ntries = 10)


## Related Packages

osmgeosample builds on Chipeta and Diggle's (2019) R package named geosample, which is used for the construction of Geostatistical sampling designs. Note that the geosample package is not present on CRAN, and can not be used for accessing OSM data. osmgeosample utilises a popular OSM mapping r package named osmdata which is authored and maintained by Mark Padgham and Robin Lovelace. In addition, the Mapview R package allows us display our results on an interactive map (Appelhans et al, 2020).

## Acknowledgements

This research was funded by the National Institute for Health Research (NIHR). Global Health Research Group on Improving Health in Slums.

In addition, we would like to show our gratitude to all of the Volunteers from the OpenStreetMap community, HOT and missing maps.

Finally, our collaborators and funders are: the National Institute for Health Research, The University of Warwick, the African Population and Health Research Center, The Aga Khan University, The Humanitarian OpenStreetMap Team, Missing Maps, HeiGIT.

## References

Eugster, Manuel J a, and Thomas Schlesinger. 2012. “Osmar: OpenStreetMap and R.” The R Journal 5 (1): 53–64.

Johnson, Peter A. 2017. “Models of Direct Editing of Government Spatial Data: Challenges and Constraints to the Acceptance of Contributed Data.” Cartography and Geographic Information Science 44 (2): 128–38. https://doi.org/10.1080/15230406.2016.1176536.

Lovelace, Robin. 2014. “Harnessing Open Street Map Data with R and QGIS.” EloGeo.

OpenStreetMap contributors. 2017. https://www.openstreetmap.org.

Padgham, Mark. 2016. Osmplotr: Customisable Images of Openstreetmap Data. https://cran.r-project.org/package=osmplotr.

Rowlingson, B. and Diggle, P. 1993 Splancs: spatial point pattern analysis code in S-Plus. Computers and Geosciences, 19, 627-655

Chipeta M G, Terlouw D J, Phiri K S and Diggle P J. (2016b). Inhibitory geostatistical designs for spatial prediction taking account of uncertain covariance structure, Enviromentrics, pp. 1-11. https://wiki.openstreetmap.org/wiki/Map_Features

Improving Health in Slums Collaborative. “A protocol for a multi-site, spatially-referenced household survey in slum settings: methods for access, sampling frame construction, sampling, and field data collection.” BMC medical research methodology vol. 19,1 109. 30 May. 2019, doi:10.1186/s12874-019-0732-x

Porto De Albuquerque, Joao & Yeboah, Godwin & Pitidis, Vangelis & Ulbrich, Philipp. (2018). Towards a participatory methodology for community data generation to analyse urban health inequalities: a multi-country case study. 10.13140/RG.2.2.13710.20804.

Yeboah G, Troilo R, Pitidis V, Porto de Albuquerque, J. Analysis of OpenStreetMap data quality at different stages of a participatory mapping process: Evidence from informal urban settings.State of the Map. Heidelberg. Sep 2019.