Generic small area estimation

Peter Gao and Jon Wakefield


In this vignette, we illustrate the use of the smoothArea and smoothUnit functions for generic small area estimation of outcomes other than under-five mortality rates.

Load packages

First, we load the necessary packages and the necessary data. The required package INLA is not in a standard repository, so we install if it is not available. The survey package will be used to generate direct estimates, while dplyr and tidyr will be used to facilitate data manipulation.

if (!isTRUE(requireNamespace("INLA", quietly = TRUE))) {
  install.packages("INLA", repos=c(getOption("repos"), 
                  INLA=""), dep=TRUE)

Area level models

Income data

First, we compare our results with those obtained using the sae package using examples presented in the sae vignette. Molina and Marhuenda (2015) generate an artificial dataset on income and other related variables to illustrate the use of area level models in the sae package. In their simulated example, the objective is to estimate prevalence of poverty in Spanish counties.

povertyline <- 0.6 * median(incomedata$income)  # 6557.143
incomedata$poor <- as.integer(incomedata$income < povertyline)

The incomedata data frame contains information on 17199 observations of individuals in 52 Spanish provinces. Income values and sampling weights are provided for each individual along with covariate information including age group and education level. Molina and Marhuenda (2015) define the poverty line z, and calculate an indicator variable with value 1 if the corresponding income value is below the poverty line and 0 otherwise.

Direct estimation with sae

The sae::direct function computes the Horvitz-Thompson estimator of domain means, given by,

\[\widehat{\overline{Y}}_{d}^{\text{DIR}}=\frac{1}{N_d}\sum_{k\in S_d}w_{k}y_{k}\]

where \(N_d\) is the population size of domain \(d\), \(S_d\) is the set of sampled observations in domain \(d\), \(w_{k}\) is the sampling weight for unit \(k\), and \(y_{k}\) is the observation for unit \(k\), for all \(k \in S_d\). The sae::direct function also estimates standard deviation and coefficient of variation for each domain.

Popn <- sizeprov[, c("provlab", "Nd")]
sae.DIR <- sae::direct(y = incomedata$poor, dom = incomedata$provlab,
    sweight = incomedata$weight, domsize = Popn)
knitr::kable(head(sae.DIR), digits = 3)
Domain SampSize Direct SD CV
Alava 96 0.255 0.048 19
Albacete 173 0.141 0.030 22
Alicante 539 0.205 0.022 11
Almeria 198 0.265 0.041 15
Avila 58 0.055 0.026 46
Badajoz 494 0.210 0.023 11

Direct estimation with survey

We can similarly use the survey::svyby function to compute the Horvitz-Thompson estimates:

incomedata$pop <- sum(sizeprov$Nd[match(incomedata$provlab, sizeprov$provlab)])
design <- survey::svydesign(ids = ~1, weights = ~weight, data = incomedata,
    fpc = ~pop)

# estimate area totals
svy.DIR <- survey::svyby(~poor, ~provlab, design, svytotal)

# calculate corresponding area mean estimates
svy.DIR$prov_pop <- sizeprov$Nd[match(svy.DIR$provlab, sizeprov$provlab)]
svy.DIR$poor_mean <- svy.DIR$poor/svy.DIR$prov_pop
svy.DIR$poor_mean_se <- svy.DIR$se/svy.DIR$prov_pop
knitr::kable(head(svy.DIR) %>%
    select(provlab, poor_mean, poor_mean_se), digits = 3)
provlab poor_mean poor_mean_se
Alava Alava 0.255 0.048
Albacete Albacete 0.141 0.030
Alicante Alicante 0.205 0.022
Almeria Almeria 0.265 0.041
Avila Avila 0.055 0.026
Badajoz Badajoz 0.210 0.023

Fay-Herriot model

The Fay-Herriot model is a basic area level model used to obtain small area estimators Fay and Herriot (1979). The Fay-Herriot model is often specified as a two-stage model; the first stage is the sampling model and the second stage is the linking model.

First stage:

Let \(\widehat\theta^{\text{DIR}}_d\) be a direct estimator of \(\theta_d\): \[\widehat\theta^{\text{DIR}}_d=\theta_d+\epsilon_d;\hspace{2em}\epsilon_d\sim_{ind}N(0,V_d),\hspace{2em}d=1,\ldots, D,\]

where \(V_d\) is the known sampling variance of the direct estimator \(\widehat{\theta}^{\text{DIR}}_d\).

Second stage:

\[\theta_d = \mathbf{x}_d^T\boldsymbol\beta+u_d,\hspace{2em}u_d\sim_{ind}N(0,\sigma_u^2)\hspace{2em}d=1,\ldots, D,\]

where \(\sigma_u^2\) (between-area residual variance) is estimated. In this basic Fay-Herriot model, the area-specific random effects \(u_d\) are assumed to be independent between areas.

Implementing Fay-Herriot model with sae

Below, we provide a quantile-quantile plot comparing the direct estimates to a Gaussian distribution. Here the observed quantiles align well with those from a Gaussian distribution, which lends some support to the basic IID model.

par(pty = "s")
muD <- mean(sae.DIR$Direct)
sdD <- sd(sae.DIR$Direct)
qqnorm((sae.DIR$Direct - muD)/sdD, main = "")
abline(0, 1, col = "red")

The sae::mseFH function calculates the empirical best linear unbiased predictors (EBLUP) for all domain means as well as their estimated MSEs.

sae.FH <- sae::mseFH(sae.DIR$Direct ~ 1, sae.DIR$SD^2)

Implementing Fay-Herriot model in SUMMER

The smoothArea function compute direct estimates and then produces smoothed estimates using a Bayesian Fay-Herriot model. The main arguments of interest are:

  • formula: describing the response variable and any area-level covariates
  • domain a one-sided formula with the variable containing domain labels on the right. The domain labels variable should be contained in the dataset used to generate the design.
  • design: a object containing survey data and specifying the survey design.

In addition, other commonly used optional arguments include:

  • Amat: Optional adjacency matrix if a spatial smoothing model is desired.
  • responseType: either "gaussian" or "binary" depending on the type of response variable. If "binary" is specified, a logit transform will be applied to the direct estimates before smoothing.
  • direct.est: If desired, the direct estimates may be specified and smoothing will be applied directly to the direct estimates.
  • X.area: Areal covariates data frame.
  • domain.size: Domain size data frame used for computing direct estimates if domain sizes are known.

Other optional arguments can be specified to change the priors and are described further in the documentation.

We obtain direct estimates and smoothed estimates below, fitting the Fay-Herriot model to the untransformed direct estimates.

# fit the model
domain.size <- sizeprov[, c("provlab", "Nd")]
colnames(domain.size)[2] <- "size"

summer.FH <- smoothArea(formula = poor ~ 1, domain = ~provlab,
    design = design, domain.size = domain.size)

The fitted parameters from sae (obtained via likelihood-based methods) and estimated parameter posterior distribution from SUMMER (obtained from Bayesian methods, implemented via INLA) are in reasonable agreement. The estimated intercept \(\beta_0\) from sae is 0.2 ; the posterior median of \(\beta_0\) from SUMMER is 0.2. In the absence of strong priors, fixed effects are usually in close agreement, with the posterior being symmetric. The estimated precision \(1/\sigma_u^2\) from sae is 281.35 , while the posterior median of \(1/\sigma_u^2\) from SUMMER is 273.4. The differences are larger here, but the posterior for the variance is skewed, and we would expect the posterior median to be smaller than the REML estimate. The area estimates and measures of uncertainty are in close agreement, however.

We first illustrate the shrinkage of the EBLUP estimates, and the reduced uncertainty:

par(mfrow = c(1, 2))
plot(sae.DIR$Direct, sae.FH$est$eblup,
     xlab = "Direct estimates",ylab = "sae package",
     xlim=c(min(sae.DIR$Direct, sae.FH$est$eblup),max(sae.DIR$Direct, sae.FH$est$eblup)),
     ylim=c(min(sae.DIR$Direct, sae.FH$est$eblup),max(sae.DIR$Direct, sae.FH$est$eblup)),
     main = "Small area estimates")
plot(sae.DIR$SD^2, sae.FH$mse,
     xlab = "Direct estimates",ylab = "sae package",
     xlim=c(min(sae.DIR$SD^2, sae.FH$mse),max(sae.DIR$SD^2, sae.FH$mse)),
     ylim=c(min(sae.DIR$SD^2, sae.FH$mse),max(sae.DIR$SD^2, sae.FH$mse)),
     main = "Estimates of uncertainty")

Now compare EBLUP and HB, using posterior variance from SUMMER and estimated MSE from sae to measure uncertainty:

par(mfrow = c(1, 2))
plot(sae.FH$est$eblup, summer.FH$s.dir.iid.est$median,
     xlab = "SAE package",ylab = "SUMMER package",
     main = "Small area estimates")
     xlab = "sae package mse",
     ylab = "SUMMER package model variance",
     main = "Estimates of mse/variance")

Milk expenditures

Molina and Marhuenda (2015) also use a real dataset on milk expenditures Arora and Lahiri (1997) to illustrate the implementation of area level models in the sae package. The milk dataset contains information on direct estimators for area level expenditures on milk based on data from the dairy survey component of the Consumer Expenditure Survey conducted by the U.S. Census Bureau. Weights are based on inverse sampling probabilities and adjusted for non-response/post-stratification.

milk$Var <- milk$SD^2
SmallArea ni yi SD CV MajorArea Var
1 191 1.10 0.16 0.15 1 0.03
2 633 1.07 0.08 0.07 1 0.01
3 597 1.10 0.08 0.07 1 0.01
4 221 0.63 0.11 0.17 1 0.01
5 195 0.75 0.12 0.16 1 0.01
6 191 0.98 0.14 0.14 1 0.02

Here, SmallArea denotes areas of inferential interest, yi gives the direct estimates, ni the sample sizes, SD and CV give the estimated standard deviation and coefficient of variation, and MajorArea corresponds to major areas (US regions defined by You and Chapman (2006)).

Results using sae

sae.FH.milk <- mseFH(yi ~ as.factor(MajorArea), SD^2)

Results using SUMMER

Since the milk dataset only provides area level direct estimates and not unit level data, we use the direct.est argument as input.

# Major Area fixed effects
Xmat <- milk[, c("SmallArea", "MajorArea")]
Xmat$MajorArea <- as.factor(Xmat$MajorArea)

# format direct estimates for SUMMER
milk.dir <- milk[, c("SmallArea", "yi", "Var")]
# Fit the model with Major Area intercepts
summer.FH.milk<- smoothArea(formula = yi~MajorArea, 
                            direct.est = milk.dir, 
                            X.area = Xmat,
                            domain = ~SmallArea)

Again note that the posterior median for \(\sigma_u^2\) is larger than the REML estimate.

# random effects precision
1 / sae.FH.milk$est$fit$refvar 
## [1] 54
## [1] 54

# estimated fixed effects
##                         beta std.error tvalue  pvalue
## X(Intercept)            0.97     0.069   14.0 2.8e-44
## Xas.factor(MajorArea)2  0.13     0.103    1.3 2.0e-01
## Xas.factor(MajorArea)3  0.23     0.092    2.5 1.4e-02
## Xas.factor(MajorArea)4 -0.24     0.082   -3.0 3.1e-03
##              mean    sd 0.025quant 0.5quant 0.97quant
## (Intercept)  0.97 0.071      0.830     0.97     1.102
## MajorArea2   0.13 0.105     -0.072     0.13     0.333
## MajorArea3   0.23 0.094      0.041     0.23     0.404
## MajorArea4  -0.24 0.083     -0.405    -0.24    -0.084

Again, we observe good agreement.

par(mfrow = c(1, 2))
     xlab = "sae package",
     ylab = "SUMMER package",
     main = "Small area estimates")
     xlab = "sae package mse",
     ylab = "SUMMER package model variance",
     main = "Estimates of mse/variance")

Spatial Fay-Herriot

The sae package also provides tools for implementing a spatial version of the Fay-Herriot model which assumes that the vector of area specific effects follows a first order simultaneous autoregressive, or SAR(1), process: \[\mathbf{u}=\rho_1\mathbf{Wu}+\boldsymbol\epsilon,\hspace{1em}\boldsymbol\epsilon\sim N(\mathbf{0}_D,\sigma_I^2\mathbf{I}_D),\] where \(\mathbf{I}_D\) is the identity matrix for the \(D\) areas and \(\mathbf{0}_D\) is a vector of zeroes of size \(D\). Additionally, \(\rho_1\in(-1,1)\) is an autoregression parameter and \(\mathbf{W}\) is an adjacency matrix (with rows standardized to sum to 1).

The sae::mseSFH function estimates the unknown variance parameters, the resulting EBLUP small area estimators, and then uses bootstrap methods to estimate the MSE of the estimators.

To illustrate the use of this function, Molina and Marhuenda (2015) consider a synthetic dataset concerning grape production surface area for 274 Italian municipalities. Below we load the relevant objects from the sae package. The grapes dataset containes direct estimators of the mean surface area in hectares for grape production in each municipality (grapehect), the sampling variance of these direct estimators (var), and relevant covariates including number of working dats and overall agrarian surface area. The grapesprox object contains the relevant adjacency matrix representing the municipalities’ neighborhood structure.


Results using sae

sae.FH.grapes <- sae::mseSFH(grapehect ~ area + workdays - 1, var, grapesprox, data = grapes)

results <- data.frame(DIR = grapes$grapehect,
                      eblup.SFH = sae.FH.grapes$est$eblup,
                      mse = sae.FH.grapes$mse)
# reorder results for comparison later
results$area_name <- paste0('area_', rownames(results))

Results using SUMMER

The smoothSurvey function also allows the use of a model with spatially correlated area effects, but the default implementation assumes a BYM2 model for \(\mathbf{u}\) rather than a simultaneous autoregressive model as in the SFH model implemented in sae.

# create area_name as SUMMER requires rownames of A_mat to match area variable
grapes$area_name <- paste0('area_', rownames(grapes))
Amat_grapes <- as.matrix(grapesprox)
rownames(Amat_grapes) <- colnames(Amat_grapes) <- grapes$area_name
X_grapes <- grapes[,c('area_name', 'area', 'workdays')]

# format direct estimates for SUMMER
grapes.dir <- grapes[, c(5, 1, 4)]
# scale direct estimates for use with INLA
grapes.dir$grapehect <- grapes.dir$grapehect / 10
grapes.dir$var <- grapes.dir$var/ 100
summer.FH.grapes<- smoothArea(formula = grapehect~area + workdays, 
                              direct.est = grapes.dir, X.area = X_grapes,
                              domain = ~area_name, Amat = Amat_grapes)

Despite the differing models, we again observe good agreement with the estimates, though less so with the estimates of uncertainty, which is interesting.

par(mfrow = c(1, 2))
     summer.FH.grapes$s.dir.sp.est$median * 10,  
     xlab = "sae package",
     ylab = "SUMMER package",
     main = "Small area estimates")
abline(0,1, col = 'red')
     summer.FH.grapes$s.dir.sp.est$var * 100,  
     xlab = "sae package mse",
     ylab = "SUMMER package model variance",
     main = "Estimates of mse/variance")
abline(0,1, col = 'red')

Unit Level Models

The nested error model, introduced by Battese, Harter, and Fuller (1988), uses auxiliary data at the unit level.

Nested error model: \[y_{dk}=\mathbf{x}_{dk}^T\boldsymbol\beta+u_d+\epsilon_{dk},\hspace{1em}u_d\sim_{ind}N(0,\sigma_u^2),\hspace{1em}\epsilon_{dk}\sim_{ind}N(0,\sigma_\epsilon^2)\]

Here \(u_d\) are area random effects and \(\epsilon_{dk}\) are unit level errors.

This model assumes there is no sample selection bias.

The sae package conducts estimation of domain means by first estimating variance parameters \(\sigma^2_u\) and \(\sigma^2_\epsilon\). Next, given known variance parameters, domain means \(\theta_d\) are predicted by calculating the EBLUPs.

The area fitted values are: \[\widehat{y}_d^{\text{EBLUP}} = f_d \overline{y}_{dS} + (\overline{X}_d-f_d \overline{x}_{dS})\widehat{\beta} + (1-f_d)\widehat{u}_d,\] where

Corn and Soy Production

The cornsoybean and cornsoybeanmeans datasets contain info on corn and soy beans production in 12 Iowa counties Battese, Harter, and Fuller (1988). The objective here is use satellite imagery of the number of pixels assigned to corn and soy to estimate the hectares grown of corn.

The variables MeanCornPixPerSeg and MeanSoyBeansPixPerSeg provide the known county means of the auxiliary variables.

We load the sample data:

##    County CornHec SoyBeansHec CornPix SoyBeansPix
## 1       1     166         8.1     374          55
## 2       2      96       106.0     209         218
## 3       3      76       103.6     253         250
## 4       4     185         6.5     432          96
## 5       4     116        63.8     367         178
## 6       5     162        43.5     361         137
## 7       5     152        71.4     288         206
## 8       5     162        42.5     369         165
## 9       6      93       105.3     206         218
## 10      6     150        76.5     316         221

Next, we load the population auxiliary information:

Xmean <-
  data.frame(cornsoybeanmeans[, c("CountyIndex",
Popn <-
  data.frame(cornsoybeanmeans[, c("CountyIndex",
##   CountyIndex MeanCornPixPerSeg MeanSoyBeansPixPerSeg
## 1           1               295                   190
## 2           2               300                   197
## 3           3               290                   205
## 4           4               291                   220
## 5           5               318                   188
## 6           6               257                   247

The sae::pbmseBHF function obtains EBLUPs under the nested error model and then uses a parametric bootstrap approach to estimate MSEs.

cornsoybean <- cornsoybean[-33, ] # remove outlier
sae.bhf <- 
  pbmseBHF(CornHec ~ CornPix + SoyBeansPix,
           dom = County, meanxpop = Xmean,
           popnsize = Popn, B = 200, 
           data = cornsoybean)

SUMMER::smoothSurvey provides the ability to fit unit level models with unit level covariates for Gaussian response variables. Below we use the is.unit argument to specify a unit level model and then provide the column names of the unit level covariates in X.unit. Finally, the X argument provides the area level means of each covariate for use when generating predictions. Note that in order to align the SUMMER estimates with those from the sae package, we specify a relatively flat prior on the variance of the area-specific random effect (pc.u = 100, pc.alpha = 0.01 specifies a penalized complexity prior such that \(P(\sigma_u > 100)=0.01\) where \(\sigma_u\) is the standard deviation of the area-specific random effects).

cornsoybean$id <- 1:dim(cornsoybean)[1]
Xsummer <- Xmean
colnames(Xsummer) = c("County", "CornPix", "SoyBeansPix")
des0 <- svydesign(ids = ~1, data = cornsoybean)
summer.bhf.unit <- smoothUnit(formula = CornHec ~ CornPix + SoyBeansPix,
                              family = "gaussian",
                              domain = ~County,
                              design = des0, X.pop = Xsummer,
                              pc.u = 1000, pc.alpha = 0.01, CI = 0.95)

Below, we plot comparisons of the sae and SUMMER results.

par(mfrow = c(1, 2))
range1 <- range(c(sae.bhf$est$eblup$eblup,summer.bhf.unit$median))
     xlab = "sae package",
     ylab = "SUMMER unit-level model package",
     main = "Small area estimates",
range2 <- range(c(sae.bhf$mse$mse, summer.bhf.unit$var))
plot(sae.bhf$mse$mse, summer.bhf.unit$model.est$var,  
     xlab = "sae package mse",
     ylab = "SUMMER unit-level model package model variance",
     main = "Estimates of mse/variance",


Below, we provide an example comparing spatial models from sae and SUMMER using data from the Behavioral Risk Factor Surveillance System (BRFSS).

BRFSS <- subset(BRFSS, !$diab2))
BRFSS <- subset(BRFSS, !$hracode))
mat <- getAmat(KingCounty, KingCounty$HRA2010v2_)
design <- svydesign(ids = ~1, weights = ~rwt_llcp,
                    strata = ~strata, data = BRFSS)
direct <- svyby(~diab2, ~hracode, design, svymean)

Results using sae

Below, we use sae to smooth the logit-transformed direct estimates.

direct$var <- direct$se ^ 2
direct$logit.diab2 <- SUMMER::logit(direct$diab2)
direct$logit.var <- direct$var / (direct$diab2 ^ 2 * (1 - direct$diab2) ^ 2)
SFH.brfss <- sae::mseSFH(logit.diab2 ~ 1, logit.var, mat, data = direct)

results <- data.frame(region = direct$hracode,
                      eblup.SFH = SUMMER::expit(SFH.brfss$est$eblup), 
                      mse = SFH.brfss$mse)

Results using SUMMER

Below, we fit two versions of the spatial area levelmodel in SUMMER. If we change pc.u and pc.alpha from the default value \(u=1,\alpha=0.01\) to \(u=0.1,\alpha=0.01\), we assign more prior mass on smaller variance of the random effects, inducing more smoothing.

summer.brfss <- smoothArea(diab2~1, domain= ~hracode,
                           design = design,
                           responseType = "binary",
                           Amat = mat, CI = 0.95)
summer.brfss.alt <- smoothArea(diab2~1, domain= ~hracode,
                           design = design,
                           responseType = "binary",
                           Amat = mat, CI = 0.95,
                           pc.u = 0.1, pc.alpha = 0.01)

Finally, we use SUMMER::mapPlot to compare median estimates and uncertainty estimates obtained via sae and SUMMER.

toplot <-  summer.brfss$s.dir.sp.est
toplot$logit.var <- toplot$var / 
  (summer.brfss$s.dir.sp.est$median ^ 2 * 
     (1 - summer.brfss$s.dir.sp.est$median) ^ 2)
toplot$median.alt <-  summer.brfss.alt$s.dir.sp.est$median
toplot$logit.var.alt <-  summer.brfss.alt$s.dir.sp.est$var / 
  (summer.brfss.alt$s.dir.sp.est$median ^ 2 *
     (1 - summer.brfss.alt$s.dir.sp.est$median) ^ 2)
toplot$median.sae <- results$eblup.SFH
toplot$mse.sae <- results$mse
variables <- c("median", "median.alt",  "median.sae",
               "logit.var", "logit.var.alt", "mse.sae")
names <- c("Median (default prior)", "Median (new prior)",  "EBLUP (sae)",
           "Variance (default prior)", "Variance (new prior)", "MSE (sae)")
g1 <- mapPlot(data = toplot, geo = KingCounty,
              labels = names[1:3], = "domain",
              by.geo = "HRA2010v2_", size = 0.1) 
g2 <- mapPlot(data = toplot, geo = KingCounty,
              variables=variables[4:6], labels = names[4:6],
     = "domain", by.geo = "HRA2010v2_", size = 0.1)  
g1 / g2

par(mfrow = c(1, 2))
range1 <- range(c(direct$diab2,toplot$median.alt))
     xlab = "direct estimates",
     ylab = "model-based estimates",
     main = "Small area estimates", col = 'red', pch = 16,
points(direct$diab2,toplot$median.sae,  col = 'blue', pch = 16)
points(direct$diab2,toplot$median.alt,  col = 'cyan', pch = 16)
legend('topleft', pch = 16, col = c('red', 'cyan', 'blue'), 
       legend = c("SUMMER",'SUMMER (new prior)', "sae"),bty="n")
range2 <- range(c(direct$logit.var,toplot$mse.sae,toplot$logit.var.alt))
     xlab = "direct estimate var.",
     ylab = "model-based uncertainty",
     main = "Small area estimates", col = 'red', pch = 16, 
points(direct$logit.var,toplot$mse.sae,  col = 'blue', pch = 16)
points(direct$logit.var,toplot$logit.var.alt,  col = 'cyan', pch = 16)
legend('topleft', pch = 16, col = c('red', 'cyan','blue'),
       legend = c("SUMMER var.", 'SUMMER var. (new prior)',"sae mse"),bty="n")

Arora, Vipin, and P. Lahiri. 1997. “On the SUPERIORITY OF THE BAYESIAN METHOD OVER THE BLUP IN SMALL AREA ESTIMATION PROBLEMS.” Statistica Sinica 7: 1053–63.
Battese, George E, Rachel M Harter, and Wayne A Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83: 28–36.
Fay, R. E., and R. A. Herriot. 1979. “Estimates of Income for Small Places: An Application of James–Stein Procedure to Census Data.” Journal of the American Statistical Association 74: 269–77.
Molina, Isabel, and Yolanda Marhuenda. 2015. “Sae: An R Package for Small Area Estimation.” The R Journal 7: 81–98.
You, Yong, and Beatrice Chapman. 2006. “Small Area Estimation Using Area Level Models and Estimated Sampling Variances.” Survey Methodology 32 (1): 97–103.