Prevalence mapping for DHS indicators using the surveyPrev package

Qianyu Dong, Zehang Richard Li, Yunhan Wu, Andrea Boskovic, Jon Wakefield

2024-04-07

1 Overview

In this vignette, we describe how prevalence mapping using DHS data can be carried out. Prevalence mapping is Small Area Estimation (SAE) for proportions of some indicator of interest. The vignette is organized as follows. In Section 2 we first describe the data preparation steps to obtain and process the relevant DHS survey data in R. We then describe three classes of models

  1. Direct estimates in Section 3.
  2. Area-level Fay-Herriot model in Section 4.
  3. Cluster-level model in Section 5.

Then in Section 6 we show several tools for visualization of different estimates, and in Section 7 we discuss how to aggregate the result for a given model to produce estimates at higher admin levels.

2 Data Preparation

2.1 Prerequisites

The following pieces of data are needed to produce SAE for DHS surveys:

2.2 Install surveyPrev

The latest surveyPrev package can be installed from CRAN with

install.packages(surveyPrev)

The development version of the surveyPrev package can be installed from GitHub with

library(devtools)
install_github("richardli/surveyPrev")

After installation, the package can be loaded with

library(surveyPrev)

One of the key required packages, INLA is not available on CRAN or GitHub, and needs to be installed with the following code (Rue, Martino, and Chopin 2009).

install.packages("INLA", repos=c(getOption("repos"), 
                  INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(INLA)

The following packages are needed to run the example analysis described in this vignette.

library(geodata)
library(sf)
library(SUMMER)
library(rdhs)
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)
library(kableExtra)

2.3 Built-in indicators

Currently, the surveyPrev package supports 22 indicators. The list of indicators and their IDs can be found in the surveyPrevIndicators dataset. The ID column correspond to the indicator IDs in the DHS API and are more standardized. The alternative ID column provides convenient and more readable indicator names for only a subset of indicators. Both can be used to retrieve the indicator in the functions described later. The full list of indicators and their IDs are summarized in the Table below.

data(surveyPrevIndicators)
head(surveyPrevIndicators)
Table 2.1: List of built-in indicators in the surveyPrev package.
ID alternative.ID Description Topic
AN_ANEM_W_ANY womananemia Percentage of women aged 15-49 classified as having any anemia Nutrition
AN_NUTS_W_THN Underweight (Prevalence of underweight (BMI < 18.5) women of reproductive age) Nutrition
CH_DIAT_C_ORT Diarrhea treatment (Children under five with diarrhea treated with either ORS or RHF) Child Health
CH_VACC_C_BAS Children age 12-23 months with all 8 basic vaccinations Child Health
CH_VACC_C_DP1 Percentage of children 12-23 months who had received DPT 1 vaccination Child Health
CH_VACC_C_DP3 DPT3 Percentage of children 12-23 months who had received DPT 3 vaccination Child Health
CH_VACC_C_MSL Percentage of children 12-23 months who had received MCV 1 (Measles containing vaccine) Child Health
CH_VACC_C_NON Children 12-23 months with no vaccinations Child Health
CM_ECMR_C_NNR nmr Probability of dying in the first month of life in the five or ten years preceding the survey Infant and Child Mortality
CN_ANMC_C_ANY Children under five with any anemia Nutrition
CN_BRFS_C_EXB Children exclusively breastfed (Prevalence of exclusive breastfeeding of children under six months of age) Nutrition
CN_NUTS_C_HA2 stunting Stunting rate (Prevalence of stunted (HAZ < -2) children under five (0-59 months)) Nutrition
CN_NUTS_C_WH2 wasting Wasting rate (Prevalence of wasted (HAZ < -2) children under five (0-59 months)) Nutrition
FP_CUSA_W_MOD Modern contraceptive prevalence rate (Married women currently using any modern method of contraception) Family Planning
FP_NADA_W_UNT unmet_family Percentage of women with an unmet need for family planning Family Planning
HA_HIVP_B_HIV HIV prevalence HIV prevalence
ML_NETP_H_IT2 Households with at least one insecticide-treated mosquito net (ITN) for every two persons who stayed in the household the previous night Malaria
RH_ANCN_W_N4P ancvisit4+ Antenatal visits for pregnancy: 4+ visits Maternal Healthcare
RH_DELA_C_SKP Assistance during delivery from a skilled provide Maternal Healthcare
WS_SRCE_P_BAS Population using a basic water source Water, Sanitation and Hygiene
WS_TLET_H_IMP sanitation Percentage of households using an improved sanitation facility Water, Sanitation and Hygiene
WS_TLET_P_BAS Population with access to a basic sanitation service Water, Sanitation and Hygiene

In this vignette, we consider estimating the percentage of women who had a live birth in the five years before the survey who had four or more antenatal care visits. This indicator can be extracted with indicator = "ancvisit4+" or indicator = "RH_ANCN_W_N4P".

2.4 Customized indicators

Details on more common DHS indicators can be found in the Guide to DHS Statistics at https://dhsprogram.com/data/Guide-to-DHS-Statistics/. User-specified function can also be used to process raw DHS data into new indicators. Please refer to the vignette on Creating Customized Indicators for surveyPrev for more details.

2.5 DHS survey data

Processing the binary indicator from the raw DHS data consist of two steps: 1. download the relevant DHS data recode using getDHSdata(), or manually from the DHS website, 2. create a data frame with the binary indicator and relevant information using getDHSindicator().

The getDHSdata() function downloads the relevant DHS data directly from the DHS website using the rdhs package. This step requires users to

  1. register with DHS to gain data access,
  2. set up the DHS account login details in R with rdhs::set_rdhs_config().
rdhs::set_rdhs_config(email = "your_email",
                project = "your_registered_DHS_project_title")

After setting up the DHS account information, getDHSdata() can be used to download the relevant survey files directly into R. If data download using the API fails, or if it is preferred to work with downloaded data files offline, you may also manually download the file from the DHS website and read into R. The getDHSdata() function returns a message specifying which file is used (e.g., Individual Record file for the ANC visit example).

indicator <- "ancvisit4+"
year <- 2018
country <- "Zambia"
dhsData <- getDHSdata(country = country, indicator = indicator, year = year)

We then use getDHSindicator() to process the raw survey data into a data frame, where the column value is the indicator of interest. The data frame also contains information specifying survey designs in the survey package, including cluster ID, household ID, survey weight and strata information.

data <- getDHSindicator(dhsData, indicator = indicator)
head(data)
##   cluster householdID    v024  weight strata value
## 1       1           1 eastern 1892890  rural     1
## 2       1           2 eastern 1892890  rural     1
## 3       1           3 eastern 1892890  rural     1
## 4       1           4 eastern 1892890  rural     0
## 5       1           7 eastern 1892890  rural     1
## 6       1           9 eastern 1892890  rural    NA

2.6 Spatial information

The getDHSgeo()function downloads the GPS data for cluster locations, also through the rdhs package. This step loads in the GPS location as a SpatialPointsDataFrame, but it can also be performed manually by downloading the GPS file from the DHS website. The GPS location of the clusters are used to match clusters to regions at different admin levels.

geo <- getDHSgeo(country = country, year = year)

In the case of Zambia, the maps are already included in the package as example datasets (data(ZambiaAdm1) and data(ZambiaAdm2)) so we load them directly within R. For other countries, the Admin 1 and Admin 2 boundary shapefiles can be downloaded from the GADM site3 and read into R manually. Another alternative is to use the geodata and sf package to download the maps from the GADM site and load them in R directly, in the following steps:

poly.adm1 <- geodata::gadm(country="ZMB", level=1, path=tempdir())
poly.adm1 <- sf::st_as_sf(poly.adm1)
poly.adm2 <- geodata::gadm(country="ZMB", level=2, path=tempdir())
poly.adm2 <- sf::st_as_sf(poly.adm2)

The clusterInfo() function extracts Admin 1 and Admin 2 area information for each cluster. To avoid confusion in countries that have duplicated region names, we create new labels for Admin 2 regions using the variable admin2.name.full. The clusters with invalid GPS locations are saved in the wrong.points object and not used in any models.

cluster.info <- clusterInfo(geo=geo, poly.adm1=poly.adm1, poly.adm2=poly.adm2, 
                            by.adm1 = "NAME_1",by.adm2 = "NAME_2")
head(cluster.info$data)
##   cluster  LONGNUM     LATNUM                   geometry admin1.name
## 1       1 32.06631 -13.895685 POINT (32.06631 -13.89568)     Eastern
## 2       2 28.73625 -13.419209 POINT (28.73625 -13.41921)  Copperbelt
## 3       3 31.13254  -8.754055 POINT (31.13254 -8.754055)    Northern
## 4       4 28.44930 -14.419112  POINT (28.4493 -14.41911)     Central
## 5       5 29.96905 -12.677275 POINT (29.96905 -12.67727)     Central
## 6       6 32.74745  -9.319054 POINT (32.74745 -9.319054)    Muchinga
##   admin2.name   admin2.name.full
## 1      Katete     Eastern_Katete
## 2     Masaiti Copperbelt_Masaiti
## 3    Mpulungu  Northern_Mpulungu
## 4       Kabwe      Central_Kabwe
## 5    Chitambo   Central_Chitambo
## 6     Nakonde   Muchinga_Nakonde
cluster.info$wrong.points
##  [1]  19  33  65 179 236 281 293 360 361 469

The adminInfo() function computes the spatial the adjacency matrix, and if supplied with additional information on population, combines needed information for each administrative area, including the name, population, and urban population fractions of each administrative area. The admin.mat object stores the adjacency matrix for spatial models. We will ignore the population information for now and focus on only information from the DHS data, and will return to the population information in Section 7.

admin.info1 <- adminInfo(poly.adm = poly.adm1, admin = 1,by.adm = "NAME_1")
admin.info2 <- adminInfo(poly.adm = poly.adm2, admin = 2,by.adm = "NAME_2",by.adm.upper = "NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full population urban
## 1     Central      Chibombo      Central_Chibombo         NA    NA
## 2     Central      Chisamba      Central_Chisamba         NA    NA
## 3     Central      Chitambo      Central_Chitambo         NA    NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi         NA    NA
## 5     Central         Kabwe         Central_Kabwe         NA    NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi         NA    NA

3 Direct estimates

3.1 National and Admin 1 direct estimates

For \(y_j\) be the binary outcome of interest for the \(j^{\text{th}}\) individual in the survey and \(w_j\) be the design weight associated with this individual. For a given area denoted as \(i\), we have the weighted estimator:

\[ \hat p^{W}_{i}=\frac {\sum_{j \in S_i} y_j w_j}{\sum_{j \in S_i} w_j}\]

where \(S_i\) is the set of individual index within the \(i\)-th region. The directEST() function calculates direct estimates at different Admin levels using the SUMMER::smoothSurvey() function in the SUMMER package internally (Li et al. 2020). The output of the function includes the direct estimates and their variance, standard error, and confidence intervals based on the specified confidence level. The confidence intervals are computed on the logit scale, i.e., if we use \(D_i\) to denote the design-based variance of \(\hat p^{W}_i\), then the asymptotic design-based variance of \(\text{logit}(\hat p^{W}_i)\) is \[ V_i = \frac{D_i}{(\hat p^{W}_i)^2(1-\hat p^{W}_i)^2} \] and we compute the confidence interval on the probability scale to be \[(\quad \text{expit}[\text{logit}(\hat p^{W}_i) - z_{\alpha/2}V_i^{1/2}], \;\; \text{expit}[\text{logit}(\hat p^{W}_i) + z_{\alpha/2}V_i^{1/2}]\quad). \]

Currently the package defaults to a two-stage stratified cluster sampling design, with the sampling clusters (enumeration areas) being stratified by Admin 1 areas and urban/rural strata, which is the most common sampling design in the DHS.

res_ad1 <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1)
head(res_ad1$res.admin1)
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5845443 0.0007002450        0.3414564      0.011873143
## 2  Copperbelt  0.6023371 0.0006960677        0.4152124      0.012132270
## 3     Eastern  0.6505890 0.0003224625        0.6216294      0.006240117
## 4     Luapula  0.6453174 0.0007216008        0.5985190      0.013774331
## 5      Lusaka  0.5817949 0.0004319779        0.3301459      0.007296978
## 6    Muchinga  0.6884903 0.0004137610        0.7930706      0.008995197
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          84.22370 0.02646214    0.5319292    0.6352999 0.04526970
## 2          82.42480 0.02638310    0.5496679    0.6527379 0.04380122
## 3         160.25341 0.01795724    0.6146268    0.6849157 0.02760151
## 4          72.59881 0.02686263    0.5910940    0.6960479 0.04162700
## 5         137.04303 0.02078408    0.5405908    0.6218883 0.03572406
## 6         111.17044 0.02034112    0.6472976    0.7269017 0.02954452

For national level (admin = 0), the function can additionally produce urban- and rural-specific direct estimates by specifying argument strata.

res_ad0 <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="all")
res_ad0_urban <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="urban")
res_ad0_rural <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="rural")
list(all = res_ad0$res.admin0, 
      urban = res_ad0_urban$res.admin0, 
      rural = res_ad0_rural$res.admin0)
## $all
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6344011 7.614786e-05        0.5511447      0.001415533          706.4475
##     direct.se direct.lower direct.upper
## 1 0.008726274    0.6171347    0.6513289
## 
## $urban
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6066641 0.0002399573        0.4333115      0.004214152          237.2957
##    direct.se direct.lower direct.upper
## 1 0.01549055    0.5759275    0.6365788
## 
## $rural
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6519019 0.0001103517        0.6274099      0.002142946          466.6473
##    direct.se direct.lower direct.upper
## 1 0.01050484    0.6310396    0.6721974

3.2 Admin 2 direct estimates

In general, direct estimates at Admin 2 are unstable. For countries with a large number of Admin 2 areas, some of the direct estimates can be undefined if there is no cluster in the area.

res_ad2 <- directEST(data = data,
                     cluster.info = cluster.info,
                     admin = 2,
                     aggregation = FALSE)

4 Area-level Fay-Herriot model

4.1 Admin 1 Fay-Herriot estimates

Fay-Herriot models provides smoothed estimates at the areal level using direct estimate \(\hat p^{W}_{i}\) as input. The direct estimates are modeled as a noisy observation of the true prevalence, with the variance of noise determined by the design-based variance. We consider the spatial Fay-Herriot model for the logit transformed direct estimates, which is defined as follows:

\[\text{logit}(\hat p^{W}_{i})|\lambda_{i} \sim\textrm{Normal}(\lambda_{i}, V_{i}^{HT}),\] \[\lambda_{i}= \alpha + e_i+S_i.\] Here \(\text{expit}(\lambda_{i})\) is the latent true prevalence, and \(e_i\) and \(S_i\) are unstructured and structured spatial random effects. Inference is carried out using Bayesian methods and so the model specific is completed by priors on \(\alpha\), \(e\) and \(S\), and their hyperpriors. More details of the Bayesian model setup can be found in Wakefield, Okonek, and Pedersen (2020). Area level Fay-Herriot model are viewed as the most reliable model choice, since they acknowledge the design through the sue of a weighted estimate and its associated variance. See chapters 4 to 6 of Rao and Molina (2015).

As of now the package allows only an overall intercept \(\alpha\), but future versions of the package will allow area level covariates to be included. The default prior for the intercept is \(N(0, 1000)\). The structured and non-structured random effects are implemented used Besag-York-Mollié (BYM) via BYM2 parameterization, with default PC priors such that the marginal standard deviation has a prior such that \(Prob(\sigma > 1) = 0.01\) and the proportion of variation explained by the spatial effect, \(\phi\) has a prior such that \(Prob(\phi > 0.5) = 2/3\) (Riebler et al. 2016; Simpson et al. 2017).

The fhModel() calculates spatial Fay-Herriot estimates at different administrative levels using SUMMER::smoothSurvey(). Users can specify either only i.i.d. random effects (i.e., no \(S_i\) term) or the BYM2 model by setting model = "iid" or model = "bym2". The function returns model results at the specified admin level with aggregation = FALSE. Aggregated to higher admin levels is possible with aggregation = TRUE when additional population information is provided. The aggregation steps are described in more details in Section 7.

smth_res_ad1_bym2 <- fhModel(data,
                        cluster.info = cluster.info,
                        admin.info = admin.info1,
                        admin = 1,
                        model = "bym2",
                        aggregation =FALSE)

smth_res_ad1_iid <- fhModel(data, 
                        cluster.info = cluster.info, 
                        admin.info = admin.info1,
                        admin = 1, 
                        model = "iid",
                        aggregation =FALSE)
head(smth_res_ad1_bym2$res.admin1) 
##   admin1.name      mean          var    median     lower     upper logit.mean
## 1     Central 0.6028279 0.0005262850 0.6035275 0.5559112 0.6455631  0.4181968
## 2  Copperbelt 0.6130262 0.0005303788 0.6135925 0.5663871 0.6567400  0.4611124
## 3     Eastern 0.6479354 0.0002699419 0.6480306 0.6153815 0.6799110  0.6107456
## 4     Luapula 0.6468364 0.0005282296 0.6471917 0.6007430 0.6910090  0.6066520
## 5      Lusaka 0.5941311 0.0003830601 0.5944710 0.5549919 0.6312606  0.3816880
## 6    Muchinga 0.6768713 0.0003558819 0.6769527 0.6397106 0.7137660  0.7407531
##     logit.var logit.median logit.lower logit.upper         sd         cv
## 1 0.009195237    0.4201849   0.2245841   0.2245841 0.02294090 0.03805548
## 2 0.009451857    0.4624387   0.2671258   0.2671258 0.02302995 0.03756765
## 3 0.005201276    0.6103938   0.4699907   0.4699907 0.01642991 0.02535732
## 4 0.010167730    0.6067175   0.4085620   0.4085620 0.02298325 0.03553177
## 5 0.006601776    0.3824795   0.2208608   0.2208608 0.01957192 0.03294209
## 6 0.007476746    0.7398026   0.5741084   0.5741084 0.01886483 0.02787064

Columns 2-6 represent the posterior mean, posterior variance, posterior median and 2.5%, and 97.5% posterior quantiles, respectively. The posterior variance can be viewed as an analogue of the mean squared error (MSE) summary.

4.2 Admin 2 Fay-Herriot estimates

A Fay-Herriot model at Admin 2 level can be fitted by treating areas without direct estimates as missing data, though it is usually not recommended due to data sparsity. In addition, numerical issue arise when design-based variance of direct estimate is close to zero and logit precision become too large. A potential fix for this issue is to identify regions with too small design variance(< 1e-30), and delete the clusters in these regions, before fitting the model. However, this step creates additional bias in the results if the number of clusters deleted is large.

bad_admin2 <- subset(res_ad2$res.admin2, direct.var < 1e-30)$admin2.name.full
# identify clusters in these regions
bad_clusters <- subset(cluster.info$data, admin2.name.full %in% bad_admin2)$cluster
# Run FH model without these clusters
smth_res_ad2 <- fhModel(subset(data, !cluster %in% bad_clusters),
                    cluster.info = cluster.info,
                    admin.info = admin.info2,
                    admin = 2,
                    model = "bym2",
                    aggregation = FALSE)
head(smth_res_ad2$res.admin2)
##        admin2.name.full      mean          var    median     lower     upper
## 1      Central_Chibombo 0.6535171 0.0026360727 0.6550209 0.5480457 0.7496094
## 2      Central_Chisamba 0.5579803 0.0006107532 0.5581483 0.5089874 0.6060565
## 3      Central_Chitambo 0.6217739 0.0101383288 0.6276832 0.4110805 0.8000109
## 4  Central_Itezhi-tezhi 0.6436590 0.0101398292 0.6505298 0.4284150 0.8197984
## 5         Central_Kabwe 0.5213792 0.0011020175 0.5213857 0.4560794 0.5861891
## 6 Central_Kapiri Mposhi 0.5106414 0.0015870676 0.5106247 0.4323244 0.5880890
##   logit.mean  logit.var logit.median logit.lower logit.upper         sd
## 1 0.64252506 0.05257552   0.64118322  0.19277749  0.19277749 0.05134270
## 2 0.23355308 0.01008966   0.23365025  0.03595353  0.03595353 0.02471342
## 3 0.52003748 0.19864247   0.52229030 -0.35950027 -0.35950027 0.10068927
## 4 0.61982125 0.20921818   0.62136870 -0.28832074 -0.28832074 0.10069672
## 5 0.08594926 0.01785317   0.08559503 -0.17613618 -0.17613618 0.03319665
## 6 0.04284320 0.02574167   0.04250537 -0.27237407 -0.27237407 0.03983802
##           cv
## 1 0.07856367
## 2 0.04429085
## 3 0.16193872
## 4 0.15644419
## 5 0.06367084
## 6 0.07801564

5 Cluster-level model

5.1 Unstratified model

Cluster-level models assume smoothing models for counts of events in each cluster (Wakefield, Okonek, and Pedersen 2020; Li et al. 2020). In terms of traditional SAE literature, cluster-level models are a type of unit-level model. We start with describing an unstratified model without taking into account the urban/rural stratification in the sampling design.

Let \(Y_c\) be the number of events in cluster \(c\), and \(n_c\) be the number of individuals at risk, where \(c= 1,\dots,C\). The unstratified model assumes the hierarchical structure:

\[Y_c \mid p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\]

where \(\alpha\) is the intercept, and \(i[s_c]\) indexes the area within which the cluster \(s_c\) resides. Similar to the area-level model, \(e_i\) and \(S_i\) are unstructured and structured spatial random effects with the same prior as before. The Beta-binomial distribution arise from a hierarchical model where the probability follows a \(\text{Beta}(a, b)\) prior. The overdispersion parameter, \(d=\frac{1}{\alpha+\beta+1}\), is between 0 and 1 and represent the the intracluster correlation between Bernoulli draws within cluster. The default prior for \(d\) is \(\text{logit}(d) \sim \text{Normal}(0,0.4)\).

The clusterModel() function fits the cluster-level model above. Unstratified model can be specified by setting stratification = FALSE, and either i.i.d. or the BYM2 model by setting model = "bym2" or model = "iid". For the the overdispersion parameter \(d\), users can change the prior mean and precision through overdisp.mean and overdisp.prec.

We fit the unstratified model in both Admin 1 and Admin 2 below. These models differ only in the admin level at which the spatial random effects enter, with the cluster-level model being the same in each case.

cl_res_ad1 <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info  = admin.info1,
                   stratification = FALSE,
                   model = "bym2",
                   admin = 1,
                   aggregation =FALSE,
                   CI = 0.95)

cl_res_ad2 <- clusterModel(data=data,
                   cluster.info= cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification =FALSE,
                   admin = 2,
                   aggregation =FALSE,
                   CI = 0.95)
head(cl_res_ad2$res.admin2)
##        admin2.name.full      mean    median         sd         var     lower
## 1      Central_Chibombo 0.6448760 0.6447917 0.04581087 0.002098636 0.5518984
## 2      Central_Chisamba 0.5963255 0.5973862 0.05510070 0.003036088 0.4839398
## 3      Central_Chitambo 0.5798794 0.5841081 0.06852435 0.004695587 0.4385194
## 4  Central_Itezhi-tezhi 0.6458915 0.6466465 0.07320677 0.005359231 0.4920623
## 5         Central_Kabwe 0.5445200 0.5434148 0.04622923 0.002137142 0.4522349
## 6 Central_Kapiri Mposhi 0.5598836 0.5622164 0.04545403 0.002066069 0.4685788
##       upper         cv admin1.name   admin2.name population urban
## 1 0.7313468 0.07103827     Central      Chibombo         NA    NA
## 2 0.6996858 0.09240039     Central      Chisamba         NA    NA
## 3 0.7063556 0.11817001     Central      Chitambo         NA    NA
## 4 0.7798355 0.11334222     Central  Itezhi-tezhi         NA    NA
## 5 0.6331086 0.08489905     Central         Kabwe         NA    NA
## 6 0.6465233 0.08118479     Central Kapiri Mposhi         NA    NA

5.2 Stratified model

Cluster-level models will produce biased estimates if the specified model does not hold for all units in the population. In the DHS, urban clusters are often oversampled. If such oversampling occurs and the outcome associated with urban/rural, then bias will result when the model does not allow for differences between urban/rural. To emphasize, bias will only arise when oversampling of urban (or rural) areas occurs, and the indicator has an association with urban/rural. One may check whether one needs to include urban/rural terms in the model, and then aggregate by urban/rural.

A fully adjusted model would include interaction terms representing the crossing of Admin 1 areas and urban/rural status. However, for reasons of parsimony we include area-level random effects and a single urban/rural fixed effect. The sampling model becomes

\[Y_c|p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+\gamma\times I(s_c \in \textrm{urban})+x_c\beta+e_{i[s_c]}+S_{i[s_c]}).\]

The area-level risk is defined as \[p_i=q_i\times \text{expit}(\alpha+\gamma+e_{i[s_c]}+S_{i[s_c]})+(1-q_i)\times \text{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\] where \(q_i\) is the proportion of urban population in area \(i\) where “urban” is defined by the survey design sampling frame.

Stratified cluster-level model can be fit by setting stratification = TRUE. The returned results then consist of urban-specific, rural-specific, and the overall area-level prevalences. In order to facilitate aggregation for cluster-level model with urban/rural effects, we need to know the fraction of urban population by each administrative region. This computation can be performed with the getUR() function (see example for details), but it requires external information (i.e., the urban/rural definition by census), so we omit this step from this document for now and assume such information is already obtained from a difference source. he urban proportions for women of age 15 to 49 in Zambia in 2018 has been pre-computed and saved in ZambiaPopWomen$admin1_urban and ZambiaPopWomen$admin2_urban.

head(ZambiaPopWomen$admin2_urban)
##     admin1.name admin2.name       urban
## 1       Eastern     Chadiza 0.049373174
## 2      Muchinga       Chama 0.054972234
## 3       Eastern     Chasefu 0.001718483
## 4 North-Western     Chavuma 0.325830536
## 5       Luapula      Chembe 0.000000000
## 6       Central    Chibombo 0.577152577

Given the urban/rural proportions, we first add the information into admin.info2 using adminInfo().T

admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,
                         proportion = ZambiaPopWomen$admin2_urban,
                         by.adm="NAME_2",by.adm.upper="NAME_1")
head(admin.info2$data)
##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258         NA
## 2     Central      Chisamba      Central_Chisamba 0.03825222         NA
## 3     Central      Chitambo      Central_Chitambo 0.00000000         NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000         NA
## 5     Central         Kabwe         Central_Kabwe 0.68604463         NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326         NA

We can then fit the stratified cluster-level model by specifying stratification = TRUE.

cl_res_ad2_T <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification = TRUE,
                   admin = 2, 
                   CI = 0.95)

6 Visualizing prevalence estimates

We can use mapPlot() function from the SUMMER package to visualize estimates and standard errors.

# Arrange all estimates into a long-format data frame
out1 <- res_ad1$res.admin1[, c("admin1.name", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad1_bym2$res.admin1[, c("admin1.name", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad1$res.admin1[, c("admin1.name", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "mean", legend.label = "Mean")

g2 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "cv", legend.label = "CV")
g1 / g2
Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

Figure 6.1: Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates

We note that for Admin 2 estimates, surveyPrev uses a recreated region name in the form of [admin1_name]_[admin2_name] to avoid duplicated Admin 2 names. This new region identifier is assigned a column name of admin2.name.full in the output. Thus when plotting, we need to manually create the admin2.name.full in the spatial polygon object.

poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)

# Arrange all estimates into a long-format data frame
out1<- res_ad2$res.admin2[, c("admin2.name.full", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
out4 <- cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full", c("admin2.name.full", "mean", "cv")]
out4$model <- "Stratified Cluster-level Model"

g1 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "mean", 
              legend.label = "Mean", ncol = 4)

g2 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "cv", 
              legend.label = "CV", ncol = 4)
g1 / g2 
Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

Figure 6.2: Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates

The scatterPlot() function takes results from two model outputs, and produce a scatter plot of the selected variable. For example, we can compare the direct estimates with the Fay-Herriot model (top row) and Cluster-level model (bottom row) in terms of their point estimates and standard deviations. In the two plots on the left, we see the expected shrinkage (attenuation) of the spatial Fay–Herriot posterior mean estimates as compared to the direct estimates. In the plots on the right we see the reduction in uncertainty which arises from using all of the data in a single Model.

s1 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct estimate",
          label1="Direct Est",
          label2="Cluster-level model")
s4 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct SE",
          label1="Direct Est",
          label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Figure 6.3: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).

Similar smoothing patterns can be observed when comparing estimates at Admin 2 level.

s1 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.est",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.se",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.est",
        value2="mean",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct estimate",
        label1="Direct Est",
        label2="Cluster-level model")
s4 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.se",
        value2="sd",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct SE",
        label1="Direct Est",
        label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Comparing  direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Figure 6.4: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.

Finally, we compare the unstratified and stratified cluster-level model at Admin 2 level. We can observe mild differences but the estimates are mostly similar between the two models.

s1 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="mean",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model estimate",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
s2 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="sd",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model SD",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
(s1 + s2 )  
Comparing unstratified and stratified cluster-level model at Admin 2 level.

Figure 6.5: Comparing unstratified and stratified cluster-level model at Admin 2 level.

Another visualization function intervalPlot() can provide interval plots at different admin levels for model comparison with compare=T. Input should be a list of model outputs by surveyPrev with self defined model name, (e.g.model=list("name 1"= fit1,"name 2"= fit2)).

intervalPlot(admin = 1, compare = TRUE, model = list(
      "Direct estimate model"= res_ad1,
      "Fay-Herriot model"= smth_res_ad1_bym2,
      "Unstratified Cluster-level model"= cl_res_ad1))