Optimization of sampling strata with the SamplingStrata package

Giulio Barcaroli

2019-04-21

Introduction

Let us suppose we need to design a sample survey, having a complete frame containing information on the target population (identifiers plus auxiliary information). If our sample design is a stratified one, we need to choose how to form strata in the population, in order to get the maximum advantage by the available auxiliary information. In other words, we have to decide in which way to combine the values of the auxiliary variables (from now on, the ‘X’ variables) in order to determine a new variable, called ‘stratum’. To do so, we have to take into consideration the target variables of our sample survey (from now on, the ‘Y’ variables): if, to form strata, we choose the X variables most correlated to the Y’s, the efficiency of the samples drawn by the resulting stratified frame may be greatly increased. In order to handle the whole auxiliary information in a homogenous way, we have to reduce continuous data to categorical (by mean of a k-means clustering technique, for example). Then, for every set of candidate auxiliary variables X’s, we have to decide (i) what variables to consider as active variables in strata determination, and (ii) for each active variable, what set of values (in general, what aggregation of atomic values) have to be considered. Every combination of values of each active variable determine a particular stratification of the target population, i.e. a possible solution to the problem of ‘best’ stratification. Here, by best stratification, we mean the stratification that ensures the minimum sample cost, sufficient to satisfy a set of precision constraints, set on the accuracy of the estimates of the survey target variables Y’s (constraints expressed as maximum allowable sampling variance on estimates in different domains of interest). When the cost of data collection is uniform over the strata, then the total cost is directly proportional to the overall sample size, and the convenience of a particular stratification can be measured by the associated size of the sample, whose estimates are expected to satisfy given accuracy levels. This minimum size can be determined by applying the Bethel algorithm, with its Chromy variant. In general, the number of possible alternative stratifications for a given population may be very high, depending on the number of variables and on the number of their values, and in these cases it is not possible to enumerate them in order to assess the best one. A very convenient solution to this, is the adoption of the evolutionary approach, consisting in applying a genetic algorithm that may converge towards a near-optimal solution after a finite number of iterations. The methodology is fully described in M. Ballin and Barcaroli (2013), and a complete illustration of the package, together with a comparison with the stratification package, is in Barcaroli (2014). Also a complete application in a case of network data is reported in Marco Ballin and Barcaroli (2016). The implementation of the genetic algorithm is based on a modification of the functions in the genalg package (see Willighagen 2005). In particular, the crossover operator ha been modified on the basis of the indications given by O’Luing, Prestwich, and Tarim (2017).

Procedural steps

The optimization of the sampling design starts by making the sampling frame available, defining the target estimates of the survey and establishing the precision constraints on them. It is then possible to determine the best stratification and the optimal allocation. Finally, we proceed with the selection of the sample. Formalizing, these are the required steps:

In the following, we will illustrate each step starting from a real sampling frame, the one that comes with the R package sampling (the dataframe swissmunicipalities).

Analysis of the frame data and manipulation of auxiliary information

As a first step, we have to define a frame dataframe containing the following information:

By typing the following statements in the R environment:

library(SamplingStrata)
#> Loading required package: memoise
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
#> Loading required package: pbapply
#> Loading required package: formattable
data(swissmunicipalities)

we get the swissmunicipalities dataframe, that contains 2896 observations (each observation refers to a Swiss municipality). Among the others, there are the following variables (data are referred to 2003):

Let us suppose we want to plan a survey whose target estimates are the totals of population by age class in each Swiss region. In this case, our Y variables will be:

As for the auxiliary variables (X’s), we can use all of those characterising the area use (wood, mountain or pasture, cultivated, industrial, with buildings).

Finally, we want to produce estimates not only for the whole country, but also for each one of the seven different regions.

Function buildFrameDF permits to organize data in a suitable mode for next processing:

swissmunicipalities$id <- c(1:nrow(swissmunicipalities))
swissframe <- buildFrameDF(df = swissmunicipalities,
                           id = "id",
                           X = c("POPTOT",
                                 "Surfacesbois",
                                 "Surfacescult",
                                 "Alp",
                                 "Airbat",
                                 "Airind"),
                           Y = c("Pop020",
                                 "Pop2040",
                                 "Pop4065",
                                 "Pop65P"),
                           domainvalue = "REG")
str(swissframe)
#> 'data.frame':    2896 obs. of  12 variables:
#>  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ X1         : int  363273 177964 166558 128634 124914 90483 72626 59496 48655 40377 ...
#>  $ X2         : int  2326 67 97 1726 1635 2807 1139 408 976 425 ...
#>  $ X3         : int  967 31 93 1041 714 1827 1222 183 196 694 ...
#>  $ X4         : int  0 0 0 0 0 0 0 0 18 0 ...
#>  $ X5         : int  2884 773 1023 1070 856 972 812 524 463 523 ...
#>  $ X6         : int  260 60 213 212 64 238 134 27 108 137 ...
#>  $ Y1         : int  57324 32429 28161 19399 24291 18942 14337 9533 9127 8128 ...
#>  $ Y2         : int  131422 60074 50349 44263 44202 28958 24309 18843 14825 11265 ...
#>  $ Y3         : int  108178 57063 53734 39397 35421 27696 21334 18177 15140 13301 ...
#>  $ Y4         : int  66349 28398 34314 25575 21000 14887 12646 12943 9563 7683 ...
#>  $ domainvalue: int  4 1 3 2 1 4 5 6 2 2 ...

As the X variables are of the continuous type, first we have to reduce them in a categorical (ordinal) form.

A suitable way to do so, is to apply a k-means clustering method (see Hartigan and Wong 1979) by using the function var.bin:

library(SamplingStrata)
swissframe$X1 <- var.bin(swissmunicipalities$POPTOT, bins=18)
swissframe$X2 <- var.bin(swissmunicipalities$Surfacesbois, bins=3)
swissframe$X3 <- var.bin(swissmunicipalities$Surfacescult, bins=3)
swissframe$X4 <- var.bin(swissmunicipalities$Alp, bins=3)
swissframe$X5 <- var.bin(swissmunicipalities$Airbat, bins=3)
swissframe$X6 <- var.bin(swissmunicipalities$Airind, bins=3)

Now, we have six different auxiliary variables of the categorical type, the first with 18 different modalities, the others with 3 modalities.

In any case, this dataframe comes with the package SamplingStrata: it can be made available by executing:

library(SamplingStrata)
data(swissframe)
head(swissframe)
#>   progr REG X1 X2 X3 X4 X5 X6         id    Y1     Y2     Y3    Y4
#> 1     1   4 18  3  2  1  3  3     Zurich 57324 131422 108178 66349
#> 2     2   1 17  1  1  1  3  2     Geneve 32429  60074  57063 28398
#> 3     3   3 17  1  1  1  3  3      Basel 28161  50349  53734 34314
#> 4     4   2 17  2  3  1  3  3       Bern 19399  44263  39397 25575
#> 5     5   1 17  2  2  1  3  2   Lausanne 24291  44202  35421 21000
#> 6     6   4 16  3  3  1  3  3 Winterthur 18942  28958  27696 14887
#>   domainvalue
#> 1           4
#> 2           1
#> 3           3
#> 4           2
#> 5           1
#> 6           4

We could also not indicate substantive X variables, if we want that each unit in the sampling frame be considered as an atomic stratum, and let to the optimization step to aggregate them on the basis of the values of the Y variable. In any case, as we have to indicate at least one X variable, we can use to this purpose the unique identifier in the frame:

swissmunicipalities$id <- c(1:nrow(swissframe))
newframe <- buildFrameDF(df = swissmunicipalities,
                         id = "id",
                         X = "id",
                         Y = c("Pop020",
                               "Pop2040",
                               "Pop4065",
                               "Pop65P"),
                         domainvalue = "REG")
str(newframe)
#> 'data.frame':    2896 obs. of  7 variables:
#>  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ X1         : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Y1         : int  57324 32429 28161 19399 24291 18942 14337 9533 9127 8128 ...
#>  $ Y2         : int  131422 60074 50349 44263 44202 28958 24309 18843 14825 11265 ...
#>  $ Y3         : int  108178 57063 53734 39397 35421 27696 21334 18177 15140 13301 ...
#>  $ Y4         : int  66349 28398 34314 25575 21000 14887 12646 12943 9563 7683 ...
#>  $ domainvalue: int  4 1 3 2 1 4 5 6 2 2 ...

Choice of the precision constraints for each target estimate

The errors dataframe contains the accuracy constraints that are set on target estimates. This means to define a maximum coefficient of variation for each variable and for each domain value. Each row of this frame is related to accuracy constraints in a particular subdomain of interest, identified by the domainvalue value. In the case of the Swiss municipalities, we have chosen to define the following constraints:

data(swisserrors)
swisserrors
#>    DOM  CV1  CV2  CV3  CV4 domainvalue
#> 1 DOM1 0.08 0.12 0.08 0.12           1
#> 2 DOM1 0.08 0.12 0.08 0.12           2
#> 3 DOM1 0.08 0.12 0.08 0.12           3
#> 4 DOM1 0.08 0.12 0.08 0.12           4
#> 5 DOM1 0.08 0.12 0.08 0.12           5
#> 6 DOM1 0.08 0.12 0.08 0.12           6
#> 7 DOM1 0.08 0.12 0.08 0.12           7

This example reports accuracy constraints on variables Y1, Y2, Y3 and Y4 that are the same for all the 7 different subdomains (Swiss regions) of domain level DOM1. Of course we can differentiate the precision constraints region by region. It is important to underline that the values of ‘domainvalue’ are the same than those in the frame dataframe, and correspond to the values of variable ‘DOM1’ in the strata dataframe. Once having defined dataframes containing frame data, strata information and precision constraints, it is worth while to check their internal and reciprocal coherence. It is possible to do that by using the function checkInput:

checkInput(errors = checkInput(errors = swisserrors, 
                               strata = swissstrata, 
                               sampframe = swissframe))
#> 
#> Input data have been checked and are compliant with requirements
#> 
#> No input data indicated

For instance, this function controls that the number of auxiliary variables is the same in the frame and in the strata dataframes; that the number of target variables indicated in the frame dataframe is the same than the number of means and standard deviations in the strata dataframe, and the same than the number of coefficient of variations indicated in the errors dataframe.

If we try to determine the total size of the sample required to satisfy these precision constraints, considering the current stratification of the frame (the 641 atomic strata), we can do it by simply using the function bethel. This function requires a slightly different specification of the constraints dataframe:

cv <- swisserrors[1,]
cv
#>    DOM  CV1  CV2  CV3  CV4 domainvalue
#> 1 DOM1 0.08 0.12 0.08 0.12           1

because the bethel function does not permit to differentiate precision constraints by subdomain. In any case, the result of the application of the Bethel algorithm [see bethel:1989] is:

allocation <- bethel(swissstrata,cv)
sum(allocation)
#> [1] 893

That is, the required amount of units to be selected, with no optimization of sampling strata. In general, after the optimization, this number is sensibly reduced.

Optimization of frame stratification

Once the strata and the constraints dataframes have been prepared, it is possible to apply the function that optimises the stratification of the frame, that is optimizeStrata. This function operates on all subdomains, identifying the best solution for each one of them. The fundamental parameters to be passed to optimizeStrata are:

In the case of the Swiss municipalities, we can use almost all of default values for parameters with the exception of the errors and strata dataframes, and for the option ‘writeFiles’:

solution1 <- optimizeStrata(
    errors = swisserrors, 
    strata = swissstrata,
    parallel = TRUE,
    writeFiles = FALSE,
    showPlot = FALSE)

Note that by so doing the initialStrata parameter is set equal to the number of atomic strata in each domain . Another possibility is to set a pre-determined value for each domain, for instance equal in each domain, as c(5,5,5,5,5,5,5,5).

The execution of optimizeStrata produces the solution of 7 different optimization problems, one for each domain.

The execution of optimizeStrata produces the solution of 7 different optimization problems, one for each domain.

The graphs (we report here the one related to domain 3) illustrate the convergence of the solution to the final one starting from the initial one (i.e. the one related to the atomic strata). Along the x-axis are reported the executed iterations, from 1 to the maximum, while on the y-axis are reported the size of the sample required to satisfy precision constraints. The upper (red) line represent the average sample size for each iteration, while the lower (black) line represents the best solution found until the i-th iteration.

The results of the execution are contained in the list ‘solution’, composed by two elements:

  1. solution1$indices: the vector of the indices that indicates to what aggregated stratum each atomic stratum belongs;
  2. solution1$aggr_strata: the dataframe containing information on the optimal aggregated strata.

We can calculate the expected CV’s by executing the function:

expected_CV(solution1$aggr_strata)
#>      cv(Y1) cv(Y2) cv(Y3) cv(Y4)
#> DOM1  0.077  0.077  0.073  0.075
#> DOM2  0.070  0.074  0.076  0.086
#> DOM3  0.078  0.079  0.078  0.084
#> DOM4  0.078  0.075  0.077  0.072
#> DOM5  0.078  0.078  0.078  0.078
#> DOM6  0.080  0.079  0.080  0.073
#> DOM7  0.074  0.078  0.072  0.085

and compare them to the set of precision constraints in order to verify the compliance:

swisserrors
#>    DOM  CV1  CV2  CV3  CV4 domainvalue
#> 1 DOM1 0.08 0.12 0.08 0.12           1
#> 2 DOM1 0.08 0.12 0.08 0.12           2
#> 3 DOM1 0.08 0.12 0.08 0.12           3
#> 4 DOM1 0.08 0.12 0.08 0.12           4
#> 5 DOM1 0.08 0.12 0.08 0.12           5
#> 6 DOM1 0.08 0.12 0.08 0.12           6
#> 7 DOM1 0.08 0.12 0.08 0.12           7

Initial solution with kmeans clustering of atomic strata

In order to speed up the convergence towards the optimal solution, an initial one can be given as a “suggestion” to ‘optimizeStrata’ function. The function KmeansSolution produces this initial solution by clustering atomic strata considering the values of the means of all the target variables Y.

Also, the optimal number of clusters is determined inside each domain. If the default value for nstrata is used, then the number of aggregate strata is optimized by varying the number of cluster from 2 to number of atomic strata in each domain, divided by 2. Otherwise, it is possible to indicate a fixed number of aggregate strata to be obtained.

Other parameters are:

For any given number of clusters, the correspondent aggregation of atomic strata is considered as input to the function ‘bethel’. The number of clusters for which the value of the sample size necessary to fulfil precision constraints is the minimum one, is retained as the optimal one.

The overall solution is obtained by concatenating optimal clusters obtained in domains. The result is a dataframe with two columns: the first indicates the clusters, the second the domains:

solutionKmeans1 <- KmeansSolution(swissstrata,
                                   swisserrors,
                                   nstrata=NA,
                                   minnumstrat=2,
                                   maxclusters=NA,
                                   showPlot=FALSE)
#> 
#> -----------------
#>  Kmeans solution 
#> -----------------
#>  *** Domain:  1  ***
#>  Number of strata:  9
#>  Sample size     :  18
#>  *** Domain:  2  ***
#>  Number of strata:  8
#>  Sample size     :  18
#>  *** Domain:  3  ***
#>  Number of strata:  8
#>  Sample size     :  15
#>  *** Domain:  4  ***
#>  Number of strata:  6
#>  Sample size     :  10
#>  *** Domain:  5  ***
#>  Number of strata:  7
#>  Sample size     :  13
#>  *** Domain:  6  ***
#>  Number of strata:  7
#>  Sample size     :  13
#>  *** Domain:  7  ***
#>  Number of strata:  8
#>  Sample size     :  16
head(solutionKmeans1)
#>   suggestions domainvalue
#> 1           8           1
#> 2           8           1
#> 3           8           1
#> 4           8           1
#> 5           8           1
#> 6           8           1

This solution can be given as argument to the parameter suggestion in the optimizeStrata function:

solution_with_kmeans <- optimizeStrata(
    errors = swisserrors,
    strata = swissstrata,
    suggestions = solutionKmeans1,
    parallel = TRUE,
    writeFiles = TRUE,
    showPlot = FALSE)
sum(ceiling(solution_with_kmeans$aggr_strata$SOLUZ))
#> [1] 101

thus obtaining a much more convenient solution than the one without the kmeans suggestion, with the same number of iterations.

Adjustment of the final sampling size

After the optimization step, the final sample size is the result of the allocation of units in final strata. This allocation is such that the precision constraints are expected to be satisfied. Actually, three possible situations may occur:

In the first case, no action is required. In the second case, it is necessary to reduce the number of units, by equally applying the same reduction rate in each stratum. In the third case, we could either to set more tight precision constraints, or proceed to increase the sample size by applying the same increase rate in each stratum. This increase/reduction process is iterative, as by applying the same rate we could find that in some strata there are not enough units to increase or to reduce. The function adjustSize permits to obtain the desired final sample size. Let us suppose that the obtained sample size is not affordable. We can reduce it by executing the following code:

adjustedStrata <- adjustSize(size=280,strata=solution1$aggr_strata,cens=NULL)
#> 
#>  298
#>  298
#>  Final adjusted size:  298
sum(adjustedStrata$SOLUZ)
#> [1] 298

Instead, if we want to increase the size because the budget allows to do this, then this is the code:

adjustedStrata <- adjustSize(size=400,strata=solution1$aggr_strata,cens=NULL)
#> 
#>  377
#>  385
#>  386
#>  387
#>  388
#>  388
#>  Final adjusted size:  388
sum(adjustedStrata$SOLUZ)
#> [1] 388

The difference between the desired sample size and the actual adjusted size depends on the number of strata in the optimized solution. Consider that the adjustment is performed in each stratum by taking into account the relative difference between the current sample size and the desired one: this produces an allocation that is expressed by a real number, that must be rounded, while taking into account the requirement of the minimum number of units in the strata. The higher the number of strata, the higher the impact on the final adjusted sample size.

Analysis of results

This function has two purposes:

  1. instrumental to the processing of the sampling frame (attribution of the labels of the optimized strata to the population units);
  2. analysis of the aggregation of the atomic strata obtained in the optimized solution.

The function updateStrata assigns the labels of the new strata to the initial one in the dataframe strata, and produces:

The function is invoked in this way:

newstrata <- updateStrata(swissstrata, 
                          solution1, 
                          writeFiles = TRUE)

Now, the atomic strata are associated to the aggregate strata defined in the optimal solution, by means of the variable LABEL. If we want to analyse in detail the new structure of the stratification, we can look at the strata_aggregation.txt file:

strata_aggregation <- read.delim("strata_aggregation.txt")
head(strata_aggregation)
#>   DOM1 AGGR_STRATUM X1 X2 X3 X4 X5 X6
#> 1    1            1  1  1  1  1  1  1
#> 2    1            1  1  1  1  1  1  2
#> 3    1            1  1  3  1  2  1  1
#> 4    1            1  3  1  1  1  1  2
#> 5    1            2  1  1  1  2  1  1
#> 6    1            2  3  2  1  1  1  1

In this structure, for each aggregate stratum the values of the X’s variables in each contributing atomic stratum are reported. It is then possible to understand the meaning of each aggregate stratum produced by the optimization.

Updating the frame and selecting the sample

Once the optimal stratification has been obtained, to be operational we need to accomplish the following two steps:

As for the first, we execute the following command:

framenew <- updateFrame(swissframe, newstrata, writeFiles=FALSE)

The function updateFrame receives as arguments the indication of the dataframe in which the frame information is memorised, and of the dataframe produced by the execution of the updateStrata function. The execution of this function produces a dataframe framenew, and also a file (named framenew.txt) with the labels of the new strata produced by the optimisation step. The allocation of units is contained in the SOLUZ column of the dataset outstrata.txt. At this point it is possible to select the sample from the new version of the frame:

sample <- selectSample(framenew, solution1$aggr_strata, writeFiles=FALSE)
#> 
#> *** Sample has been drawn successfully ***
#>  332  units have been selected from  100  strata
#> 
#> ==> There have been  20  take-all strata 
#> from which have been selected  58 units

that produces two .csv files:

The selectSample operates by drawing a simple random sampling in each stratum.

A variant of this function is selectSampleSystematic. The only difference is in the method used for selecting units in each strata, that is by executing the following steps:

  1. a selection interval is determined by considering the inverse of the sampling rate in the stratum;
  2. a starting point is determined by selecting a value in this interval;
  3. the selection proceeds by selecting as first unit the one corresponding to the above value, and then selecting all the units individuated by adding the selection interval.

This selection method can be useful if associated to a particular ordering of the selection frame, where the ordering variable(s) can be considered as additional stratum variable(s). For instance, we could decide that it could be important to consider the overall population in municipalities when selecting units. Here is the code:

# adding POPTOT to framenew
data("swissmunicipalities")
framenew <- merge(framenew,swissmunicipalities[,c("REG","Nom","POPTOT")],
                  by.x=c("REG","ID"),by.y=c("REG","Nom"))
# selection of sample with systematic method
sample <- selectSampleSystematic(frame=framenew,
                                 outstrata=solution1$aggr_strata,
                                 sortvariable = c("POPTOT"))
#> 
#> *** Sample has been drawn successfully ***
#>  332  units have been selected from  100  strata
#> 
#> ==> There have been  20  take-all strata 
#> from which have been selected  58 units
head(sample,3)
#>   DOMAINVALUE STRATO REG          ID      STRATUM PROGR X1 X2 X3 X4 X5 X6
#> 1           1      1   1 Constantine  1*1*1*1*1*1  2302  1  1  1  1  1  1
#> 2           1      1   1       Syens  1*1*1*1*1*1  2698  1  1  1  1  1  1
#> 3           1     10   1     Monthey 12*2*1*1*2*3    75 12  2  1  1  2  3
#>     Y1   Y2   Y3   Y4 LABEL POPTOT WEIGHTS        FPC
#> 1   70   75   87   48     1    280    93.5 0.01069519
#> 2   24   34   44   15     1    117    93.5 0.01069519
#> 3 3359 4170 4509 1895    10  13933     1.0 1.00000000

Evaluation of the found solution

In order to be confident about the quality of the found solution, the function evalSolution allows to run a simulation, based on the selection of a desired number of samples from the frame to which the stratification, identified as the best, has been applied. The user can invoke this function also indicating the number of samples to be drawn:

eval <- evalSolution(framenew, 
                     solution1$aggr_strata, 
                     nsampl=100, 
                     writeFiles=FALSE,
                     progress=FALSE) 

For each drawn sample, the estimates related to the Y’s are calculated. Their mean and standard deviation are also computed, in order to produce the CV related to each variable in every domain. These CV’s can be inspected and compared to the constraints:

eval$coeff_var
domain cv(Y1) cv(Y2) cv(Y3) cv(Y4)
1 0.0800 0.0778 0.0747 0.0755
2 0.0671 0.0733 0.0721 0.0794
3 0.0764 0.0824 0.0749 0.0759
4 0.0824 0.0756 0.0809 0.0746
5 0.0846 0.0873 0.0894 0.0836
6 0.0734 0.0730 0.0719 0.0679
7 0.0843 0.0938 0.0904 0.1101
swisserrors
#>    DOM  CV1  CV2  CV3  CV4 domainvalue
#> 1 DOM1 0.08 0.12 0.08 0.12           1
#> 2 DOM1 0.08 0.12 0.08 0.12           2
#> 3 DOM1 0.08 0.12 0.08 0.12           3
#> 4 DOM1 0.08 0.12 0.08 0.12           4
#> 5 DOM1 0.08 0.12 0.08 0.12           5
#> 6 DOM1 0.08 0.12 0.08 0.12           6
#> 7 DOM1 0.08 0.12 0.08 0.12           7

These values are on average compliant with the precision constraints set.

We can also inspect the relative bias:

eval$rel_bias
domain bias(Y1) bias(Y2) bias(Y3) bias(Y4)
1 -0.0102 -0.0127 -0.0124 -0.0114
2 -0.0012 -0.0010 -0.0027 -0.0040
3 -0.0085 -0.0116 -0.0071 -0.0049
4 -0.0041 -0.0044 -0.0046 -0.0025
5 0.0005 -0.0009 -0.0008 -0.0003
6 -0.0110 -0.0096 -0.0081 -0.0089
7 -0.0059 -0.0058 -0.0055 -0.0007

It is also possible to analyse the sampling distribution of the estimates for each variable of interest in a selected domain:

dom = 1
hist(eval$est$Y1[eval$est$dom == dom], col = "grey", border = "white",
     xlab = expression(hat(Y)[1]),
     freq = FALSE,
     main = paste("Variable Y1 Domain ",dom,sep=""))
abline(v = mean(eval$est$Y1[eval$est$dom == dom]), col = "blue", lwd = 2)
abline(v = mean(swissframe$Y1[swissframe$domainvalue==dom]), col = "red")
legend("topright", c("distribution mean", "true value"),
       lty = 1, col = c("blue", "red"), box.col = NA, cex = 0.8)

Handling ‘take-all’ strata in the optimization step

As input to the optimization step, together with proper sampling strata, it is also possible to provide take-all strata. These strata will not be subject to optimisation as the proper strata, but they will contribute to the determination of the best stratification, as their presence in a given domain will permit to satisfy precision constraint with a lower number of units belonging to proper sampling strata.

In order to correctly execute the optimization and further steps, it is necessary to perform a pre-processing of the overall input. The first step to be executed consists in the bi-partition of units to be censused and of units to be sampled, in order to build two different frames. As an example, we take the units with the highest values of the auxiliary variables as the ones to be selected in any case:

data(swisserrors)
data(swissstrata)
data(swissframe)
#----Selection of units to be censused from the frame
ind_framecens <- which(swissframe$X1 > 12 |
                       swissframe$X2 > 2 | 
                       swissframe$X3 > 2 |
                       swissframe$X4 > 2 |
                       swissframe$X5 > 2 | 
                       swissframe$X6 > 2 )
framecens <- swissframe[ind_framecens,]
nrow(framecens)
#> [1] 302
#----Selection of units to be sampled from the frame
# (complement to the previous)
framesamp <- swissframe[-ind_framecens,]
nrow(framesamp)
#> [1] 2594

In this way, we have included all units to be surely selected in ‘framecens’, and the remaining in ‘framesamp’. At the end of the process, the sample will be selected only from ‘framesamp’, while the units in ‘framecens’ will be simply added to the sample.

We can obtain census strata and sampling strata by applying buildStrataDF respectively to framecens and framesamp:

# Build strata to be censused and sampled
cens <- buildStrataDF(framecens,progress = FALSE)
#> 
#> Computations are being done on population data
#> 
#> Number of strata:  231
#> ... of which with only one unit:  187
sum(cens$N)
#> [1] 302
strata <- buildStrataDF(framesamp,progress = FALSE)
#> 
#> Computations are being done on population data
#> 
#> Number of strata:  410
#> ... of which with only one unit:  202
sum(strata$N)
#> [1] 2594

and

sum(cens$N)
#> [1] 302
sum(strata$N)
#> [1] 2594

Now we have all required inputs to run ‘optimizeStrata’ in presence of the ‘take-all’ strata:

solution2 <- optimizeStrata(
    errors = swisserrors, 
    strata = strata, 
    cens = cens, 
    strcens = TRUE,
    parallel = TRUE,
    writeFiles = TRUE,
    showPlot = FALSE
)

Once the optimized solution has been produced, the next steps are executed by considering only the sampling part of the frame:

newstrata <- updateStrata(strata, solution2)
# updating sampling frame with new strata labels
framenew <- updateFrame(frame=framesamp,newstrata=newstrata)
# selection of sample from sampling strata
sample <- selectSample(frame=framenew,outstrata=solution2$aggr_strata)
#> 
#> *** Sample has been drawn successfully ***
#>  182  units have been selected from  65  strata
#> 
#> ==> There have been  6  take-all strata 
#> from which have been selected  7 units

Finally, the units in the ‘take-all’ strata can be added to sampled ones. First, the census frame needs to be made homogeneous to the sample frame in order to permit the ‘rbind’ step:

# addition of necessary variables to 
colnames(framesamp) <- toupper(colnames(framesamp))
colnames(framecens) <- toupper(colnames(framecens))
framecens$WEIGHTS <- rep(1,nrow(framecens))
framecens$FPC <- rep(1,nrow(framecens))
framecens$LABEL <- rep("999999",nrow(framecens))
framecens$STRATUM <- rep("999999",nrow(framecens))
framecens$STRATO <- rep("999999",nrow(framecens))

The overall set of units to be surveyed is obtainable in this way:

survey <- rbind(sample,framecens)

and this is the proportion of sampling and censused units:

survey$cens <- ifelse(survey$LABEL == "999999",1,0)
table(survey$cens)
#> 
#>   0   1 
#> 182 302

In order to verify compliance to the precision constraint, we perform the following:

cens2 <- cens[,-c(14:19)]
cens2$SOLUZ <- cens2$N
stratatot <- rbind(solution2$aggr_strata,cens2)
expected_CV(stratatot)
#>      cv(Y1) cv(Y2) cv(Y3) cv(Y4)
#> DOM1  0.078  0.073  0.076  0.080
#> DOM2  0.075  0.074  0.076  0.079
#> DOM3  0.078  0.078  0.076  0.080
#> DOM4  0.080  0.074  0.080  0.087
#> DOM5  0.075  0.079  0.077  0.075
#> DOM6  0.073  0.078  0.080  0.084
#> DOM7  0.078  0.078  0.075  0.085

Handling Anticipated Variance

In the previous sections it has been assumed that, when optimizing the stratification of a sampling frame, values of the target variables Y’s are available for the generality of the units in the frame, or at least for a sample of them by means of which it is possible to estimate means and standard deviation of Y’s in atomic strata. Of course, this assumption is seldom expected to hold. The situation in which some proxy variables are available in the frame is much more likely to happen. In these situations, instead of directly indicating the real target variables, proxy ones are named as Y’s. By so doing, there is no guarantee that the final stratification and allocation can ensure the compliance to the set of precision constraints.
In order to take into account this problem, and to limit the risk of overestimating the expected precision levels of the optimized solution, it is possible to carry out the optimization by considering, instead of the expected coefficients of variation related to proxy variables, the anticipated coefficients of variation (ACV) that depend on the model that is possile to fit on couples of real target variables and proxy ones. In the current implementation, only models linking continuous variables can be considered. The definition and the use of these models is the same that has been implemented in the package stratification (see Baillargeon and Rivest 2014). In particular, the reference here is to two different models, the linear model with heteroscedasticity:

\[Y=beta\times X + epsilon\]

where

\[epsilon \sim N(0,sig2 X^{gamma})\]

(in case gamma = 0, then the model is homoscedastic)

and the loglinear model:

\[Y= \exp (beta \times log(X) + epsilon)\]

where

\[epsilon \sim N(0,sig2)\]

In order to make evident the importance of the above, consider the following example, based on the dataset nations available in the package.

data(nations)
head(nations)
#>          Country  TFR contraception infant.mortality   GDP  region
#> 1    Afghanistan 6.90            63              154  2848    Asia
#> 2        Albania 2.60            47               32   863  Europe
#> 3        Algeria 3.81            52               44  1531  Africa
#> 4 American-Samoa 1.35            71               11  2433 Oceania
#> 5        Andorra 1.61            71                7 19121  Europe
#> 6         Angola 6.69            19              124   355  Africa
#>   Continent
#> 1         2
#> 2         1
#> 3         4
#> 4         5
#> 5         1
#> 6         4

Let us assume that in the sampling frame only variable GDP (Gross Domestic Product) is available for all countries, while contraception rates and infant mortality rates are available only on a subset of countries (about one third).

set.seed(1234)
nations_sample <- nations[sample(c(1:207),70),]

In this subset we can fit models between GDP and the two variables that we assume are the target of our survey.

One model for infant mortality and GDP:

mod_logGDP_INFMORT <- lm(log(nations_sample$infant.mortality) ~ log(nations_sample$GDP))
summary(mod_logGDP_INFMORT)
#> 
#> Call:
#> lm(formula = log(nations_sample$infant.mortality) ~ log(nations_sample$GDP))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.1292 -0.3765 -0.1455  0.3316  2.6345 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value             Pr(>|t|)
#> (Intercept)              6.86295    0.33620   20.41 < 0.0000000000000002
#> log(nations_sample$GDP) -0.46580    0.04389  -10.61 0.000000000000000452
#>                            
#> (Intercept)             ***
#> log(nations_sample$GDP) ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.6158 on 68 degrees of freedom
#> Multiple R-squared:  0.6236, Adjusted R-squared:  0.6181 
#> F-statistic: 112.7 on 1 and 68 DF,  p-value: 0.0000000000000004523

and one model for contraception and GDP:

mod_logGDP_CONTRA <- lm(log(nations_sample$contraception) ~ log(nations_sample$GDP))
summary(mod_logGDP_CONTRA)
#> 
#> Call:
#> lm(formula = log(nations_sample$contraception) ~ log(nations_sample$GDP))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.96139 -0.27360 -0.01435  0.45058  1.25143 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value         Pr(>|t|)    
#> (Intercept)              0.98318    0.30538   3.220          0.00197 ** 
#> log(nations_sample$GDP)  0.34649    0.03986   8.692 0.00000000000122 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5593 on 68 degrees of freedom
#> Multiple R-squared:  0.5263, Adjusted R-squared:  0.5193 
#> F-statistic: 75.55 on 1 and 68 DF,  p-value: 0.000000000001217

We define the sampling frame in this way:

nations$progr <- c(1:nrow(nations))
nations$dom <- 1
frame <- buildFrameDF(nations,
                      id="Country",
                      X="progr",
                      Y=c("GDP","GDP"),
                      domainvalue = "dom")

that is, we replicate twice the variable GDP because it will be used once for infant mortality and once for contraception.

We set 10% and 5% precision constraints on these two variables:

cv <- as.data.frame(list(DOM=rep("DOM1",1),
                         CV1=rep(0.10,1),
                         CV2=rep(0.05,1),
                         domainvalue=c(1:1)
                    ))
cv
#>    DOM CV1  CV2 domainvalue
#> 1 DOM1 0.1 0.05           1

We build the strata without any assumption on the variability of the two target variables, and proceed in the optimization:

strata1 <- buildStrataDF(frame, progress = FALSE)
#> 
#> Computations are being done on population data
#> 
#> Number of strata:  207
#> ... of which with only one unit:  207
solution3 <- optimizeStrata(cv,
                           strata1,
                           iter = 50,
                           pops = 20,
                           parallel = FALSE,
                           suggestions = KmeansSolution(strata1,cv),
                           writeFiles = FALSE,
                           showPlot = FALSE)

#> 
#> -----------------
#>  Kmeans solution 
#> -----------------
#>  *** Domain:  1  ***
#>  Number of strata:  7
#>  Sample size     :  17
#>  *** Domain :  1   1
#>  Number of strata :  207
#> ---------------------------------------------
#> Optimal stratification with Genetic Algorithm
#> ---------------------------------------------
#>  *** Parameters ***
#> ---------------------------
#> Domain:  1
#> Maximum number of strata:  207
#> Minimum number of units per stratum:  2
#> Take-all strata (TRUE/FALSE):  FALSE
#> number of sampling strata :  207
#> Number of target variables:  2
#> Number of domains:  1
#> Number of GA iterations:  50
#> Dimension of GA population:  20
#> Mutation chance in GA generation:  NA
#> Elitism rate in GA generation:  0.2
#> Chance to add strata to maximum:  0
#> Allocation with real numbers instead of integers:  TRUE
#> Suggestion:  5 3 3 6 6 7 5 6 7 4 2 7 2 2 2 7 6 6 5 7 6 1 7 6 1 6 5 5 7 6 2 6 7 6 1 5 7 6 2 1 3 3 6 6 6 1 2 2 4 6 4 5 7 7 5 7 6 5 1 1 1 1 7 5 6 4 5 4 7 3 6 4 6 4 6 2 5 1 7 6 6 5 1 6 5 6 7 7 2 2 2 5 6 6 5 6 7 5 7 5 5 6 3 6 6 3 3 2 4 6 7 6 6 5 6 5 5 6 7 6 6 6 6 6 5 5 3 7 6 6 7 6 6 3 7 6 6 4 6 7 6 5 7 4 5 5 7 1 5 6 2 6 3 5 7 3 7 7 7 6 6 5 6 6 5 3 2 3 3 4 6 7 6 2 4 6 1 5 1 3 7 6 6 6 6 4 6 3 5 2 6 7 7 1 3 3 6 3 6 7 2 7 7 6 6 6 1
#>  *** Sample cost:  16.68519
#>  *** Number of strata:  7

#> 
#>  *** Sample size :  17
#>  *** Number of strata :  7
#> ---------------------------
sum(solution3$aggr_strata$SOLUZ)
#> [1] 16.68519

Then, we evaluate the expected CV’s on the three variables:

newstrata <- updateStrata(strata1,solution3)
framenew1 <- updateFrame(frame,newstrata)
framenew1 <- framenew1[order(framenew1$ID),]
framenew1$Y2 <- nations$infant.mortality
framenew1$Y3 <- nations$contraception
results1 <- evalSolution(framenew1, solution3$aggr_strata, 50, progress = FALSE)
results1$coeff_var
domain cv(Y1) cv(Y2) cv(Y3)
1 0.0463 0.2289 0.1123

Clearly, the CV’s on infant mortality and contraception are not compliant with the corresponding precision constraints.

We now proceed in building the strata dataframe using the models:

model <- NULL
model$beta[1] <- mod_logGDP_INFMORT$coefficients[2]
model$sig2[1] <- summary(mod_logGDP_INFMORT)$sigma
model$type[1] <- "loglinear"
model$gamma[1] <- 0
model$beta[2] <- mod_logGDP_CONTRA$coefficients[2]
model$sig2[2] <- summary(mod_logGDP_CONTRA)$sigma
model$type[2] <- "loglinear"
model$gamma[2] <- 0
model <- as.data.frame(model)
model
#>         beta      sig2      type gamma
#> 1 -0.4658038 0.6157600 loglinear     0
#> 2  0.3464857 0.5593031 loglinear     0
strata2 <- buildStrataDF(frame, model = model, progress = FALSE)
#> 
#> Computations are being done on population data
#> 
#> Number of strata:  207
#> ... of which with only one unit:  207
head(strata2)
#>     STRATO N          M1        M2          S1        S2 COST CENS DOM1
#> 1        1 1 0.024595860 15.737965 0.022690436 13.624506    1    0    1
#> 10      10 1 0.009910578 30.945032 0.009142812 26.789408    1    0    1
#> 100    100 1 0.011086660 28.468488 0.010227784 24.645441    1    0    1
#> 101    101 1 0.067027720  7.465936 0.061835128  6.463332    1    0    1
#> 102    102 1 0.064539738  7.678981 0.059539888  6.647767    1    0    1
#> 103    103 1 0.030744575 13.331076 0.028362814 11.540840    1    0    1
#>      X1
#> 1     1
#> 10   10
#> 100 100
#> 101 101
#> 102 102
#> 103 103

We proceed with the optimization

strata2 <- buildStrataDF(frame, model = model, progress = FALSE)
#> 
#> Computations are being done on population data
#> 
#> Number of strata:  207
#> ... of which with only one unit:  207
solution4 <-
  optimizeStrata(
    errors = cv , 
    strata = strata2, 
    iter = 50, 
    pops = 20, 
    parallel = FALSE,
    suggestions = KmeansSolution(strata2,cv),
    showPlot = FALSE,
    writeFiles = FALSE)

#> 
#> -----------------
#>  Kmeans solution 
#> -----------------
#>  *** Domain:  1  ***
#>  Number of strata:  11
#>  Sample size     :  113
#>  *** Domain :  1   1
#>  Number of strata :  207
#> ---------------------------------------------
#> Optimal stratification with Genetic Algorithm
#> ---------------------------------------------
#>  *** Parameters ***
#> ---------------------------
#> Domain:  1
#> Maximum number of strata:  207
#> Minimum number of units per stratum:  2
#> Take-all strata (TRUE/FALSE):  FALSE
#> number of sampling strata :  207
#> Number of target variables:  2
#> Number of domains:  1
#> Number of GA iterations:  50
#> Dimension of GA population:  20
#> Mutation chance in GA generation:  NA
#> Elitism rate in GA generation:  0.2
#> Chance to add strata to maximum:  0
#> Allocation with real numbers instead of integers:  TRUE
#> Suggestion:  7 8 8 2 2 6 7 10 9 4 11 6 3 11 3 6 5 5 4 9 2 1 6 2 1 2 7 7 6 2 3 2 9 5 1 7 6 5 3 1 8 8 2 5 10 1 3 11 4 10 4 7 9 6 7 9 2 7 1 1 8 1 6 7 2 4 7 4 9 8 5 4 10 4 2 3 7 1 9 10 5 7 8 10 7 5 9 6 3 3 11 7 5 5 7 2 6 4 6 7 7 2 8 2 10 8 8 3 4 2 9 9 2 7 2 7 7 10 6 2 10 5 9 2 7 4 8 6 5 5 6 5 10 8 9 2 5 4 10 7 2 7 9 4 7 4 6 1 4 5 11 9 8 7 6 8 6 9 6 2 5 7 5 2 7 3 3 3 8 4 2 6 2 3 4 2 1 7 1 8 6 2 5 10 2 1 10 3 4 3 2 9 6 1 8 8 2 8 10 6 11 9 9 2 10 2 1
#>  *** Sample cost:  108.8512
#>  *** Number of strata:  11

#> 
#>  *** Sample size :  109
#>  *** Number of strata :  11
#> ---------------------------

This time the sample size is much higher.

What about the expected CV’s?

newstrata <- updateStrata(strata2,solution4)
framenew2 <- updateFrame(frame,newstrata)
framenew2 <- framenew2[order(framenew2$ID),]
framenew2$Y2 <- nations$infant.mortality
framenew2$Y3 <- nations$contraception
results2 <- evalSolution(framenew2, solution4$aggr_strata, 50, progress = FALSE)
results2$coeff_var
domain cv(Y1) cv(Y2) cv(Y3)
1 0.0062 0.0458 0.0277

This time the expected CV’s of all variables are more than compliant with the precision constraints.

Optimization variant with the optimizeStrata2 function

Function optimizeStrata2 performs the same task than optimizeStrata, but with a different Genetic Algorithm, operating on a genome represented by vector containing real values, instead of integers. This permits to operate directly on the boundaries of the strata, instead of aggregating the initial atomic strata. In some situations (not exceedingly too big size of the sampling frame) this new function is much more efficient. Furthermore, resulting strata are guaranteed to not overlap with respect to the different stratification variables. A major limitation is in the nature of the stratification variables, that are required to be all continuous (though categorical ordinal can be handled). Another one is in the fact that it is necessary to choose the number of strata (it is not optimally determined during the evolution process as in the case of optimizeStrata).

To operate with optimizeStrata2 it is no more necessary to produce the strata dataframe.

Let us consider the following example:

data("swissmunicipalities")
swissmunicipalities$id <- c(1:nrow(swissmunicipalities))
swissmunicipalities$dom <- 1
frame <- buildFrameDF(swissmunicipalities,
                      id = "id",
                      domainvalue = "REG",
                      X = c("Surfacesbois","Surfacescult"),
                      Y = c("Pop020", "Pop2040")
)
# choice of units to be selected in any case (census units)
framecens <- frame[frame$X1 > 2500 
                   | frame$X2 > 1200,]
# remaining units 
framesamp <- frame[!(frame$id %in% framecens$id),]
# precision constraints
errors <- NULL
errors$DOM <- "DOM1"
errors$CV1 <- 0.1
errors$CV2 <- 0.1
errors <- as.data.frame(errors)
errors <- errors[rep(row.names(errors),7),]
errors$domainvalue <- c(1:7)
errors
#>      DOM CV1 CV2 domainvalue
#> 1   DOM1 0.1 0.1           1
#> 1.1 DOM1 0.1 0.1           2
#> 1.2 DOM1 0.1 0.1           3
#> 1.3 DOM1 0.1 0.1           4
#> 1.4 DOM1 0.1 0.1           5
#> 1.5 DOM1 0.1 0.1           6
#> 1.6 DOM1 0.1 0.1           7

By so doing, we have chosen two stratification variables (Surfacesbois and Surfacescult) and two target variables (Pop020 and Pop2040), on both of which we have set the same precision constraint (a maximum CV equal to 10%).

Now the execution of the optimization step (for the domain 4) using optimizeStrata2 is straightforward, as we do not need to categorize stratification variables and to generate the atomic strata:

solution5 <- optimizeStrata2 (
  errors, 
  framesamp = framesamp,
  framecens = framecens, 
  strcens = TRUE, 
  alldomains = FALSE,
  dom = 4,
  iter = 50,
  pops = 20,
  nStrata = 5,
  writeFiles = FALSE,
  showPlot = FALSE,
  parallel = FALSE
)
sum(round(solution5$aggr_strata$SOLUZ))
#> [1] 35
expected_CV(solution5$aggr_strata)
#>      cv(Y1) cv(Y2)
#> DOM1    0.1  0.098

This function also outputs the new version of the overall frame (sampling plus census), already provided with the labels indicating to which stratum each unit belongs:

framenew <- solution5$framenew
table(framenew$LABEL)
#> 
#>  1  2  3  4  5  6 
#> 27 52 32 49  5  6

The first 5 strata are the ones pertaining to sampling units, the 6th is the one with censused units.

To inspect the structure of the strata, the function summaryStrata is available:

strataStructure <- summaryStrata(solution5$framenew,solution5$aggr_strata,progress=FALSE)
strataStructure
#>   Domain Stratum Population Allocation SamplingRate Lower_X1 Upper_X1
#> 1      4       1         27          4     0.131674        5      189
#> 2      4       2         52          5     0.093340       21      215
#> 3      4       3         32          7     0.212847      153      297
#> 4      4       4         49          8     0.172122      125     1126
#> 5      4       5          5          5     1.000000      168     2326
#> 6      4       6          6          6     1.000000      261     2807
#>   Lower_X2 Upper_X2
#> 1       40      167
#> 2      156      665
#> 3      160      738
#> 4      239      890
#> 5      853     1142
#> 6     1204     1827

It is also possible to investigate the distribution of population units in the strata by using the function plotStrata2d:

outstrata <- plotStrata2d(
                  solution5$framenew, 
                  solution5$aggr_strata,
                  domain = 4, 
                  vars = c("X1","X2"),
                  labels =     c("Surfacesbois","Surfacescult")
                  )

Together with the plot, also a tabular format of the optimized strata is produced by plotStrata2d:

outstrata
Stratum Population Allocation SamplingRate Bounds Surfacesbois Bounds Surfacescult
1 27 4 0.13167379 5-189 40-167
2 52 5 0.09334017 21-215 156-665
3 32 7 0.21284696 153-297 160-738
4 49 8 0.17212183 125-1126 239-890
5 5 5 1.00000000 168-2326 853-1142
6 6 6 1.00000000 261-2807 1204-1827

(The 6th ‘take-all’ stratum is not graphically represented because of its overlapping with the other strata.)

Finally, the selection of the sample is performed by using the same function selectSample, but this time is no more necessary to handle separately units to be samples and units to be censused:

samp <- selectSample(solution5$framenew,solution5$aggr_strata)
#> 
#> *** Sample has been drawn successfully ***
#>  35  units have been selected from  6  strata
#> 
#> ==> There have been  2  take-all strata 
#> from which have been selected  11 units

Appendix - Methodological approach

In a stratified sampling design with one or more stages, a sample is selected from a frame containing the units of the population of interest, stratified according to the values of one or more auxiliary variables (X) available for all units in the population.

For a given stratification, the overall size of the sample and the allocation in the different strata can be determined on the basis of constraints placed on the expected accuracy of the various estimates regarding the survey variables (Y).

If the target survey variables are more than one the optimization problem is said to be multivariate; otherwise it is univariate.

For a given stratification, in the univariate case the optimization of the allocation is in general based on the Neyman allocation. In the univariate case it is possible to make use of the Bethel algorithm.

The criteria according to which stratification is defined are crucial for the efficiency of the sample.

With the same precision constraints, the overall size of the sample required to satisfy them may be significantly affected by the particular stratification chosen for the population of interest.

Given G survey variables, their sampling variance is:

\[Var(\hat{Y_{g}})=\sum_{h=1}^{H}N_{h}^{2} (1- \frac{ n_{h}} {N_{h}}) \frac{ S_{h,g}^{2}} {n_{h}} \;\;\; g=1,...,G\]

If we introduce the following cost function:

\[C(n_{1},...,n_{H})=C_{0}+\sum_{h=1}^{H}C_{h}n_{h} \]

the optimization problem can be formalized in this way:

\[min= C_{0}+\sum_{h=1}^{H}C_{h}n_{h}\\ \] under the constraints \[ \begin{cases} CV(\hat{Y_{1}}) < U_{1}\\ CV(\hat{Y_{2}}) < U_{2}\\ ...\\ CV(\hat{Y_{G}}) < U_{G}\\ \end{cases} \] where \[ CV(\hat{Y_{g}}) = \frac{\sqrt{Var(\hat{Y_{g}})} } {mean(\hat{Y_{g}})}\]

Given a population frame with m auxiliary variables \(X_{1},..., X_{M}\) we define as atomic stratification the one that can be obtained considering the cartesian product of the definition domains of the m variables. \[L=\{(l_{1}),(l_{2}),...,(l_{k})\}\] Starting from the atomic stratification, it is possible to generate all the different stratifications that belong to the universe of stratifications. For example:

\[ \begin{align*} &P_{1}=\{(l_{1},l_{2},l_{3})\} & P_{2}=\{(l_{1}),(l_{2},l_{3})\} \\ &P_{2}=\{(l_{2}),(l_{1},l_{3})\} & P_{4}=\{(l_{31}),(l_{1},l_{2})\} \\ &P_{5}=\{(l_{1}),(l_{2}),(l_{k})\} \end{align*} \]

The number of feasible stratifications is exponential with respect to the number of initial atomic strata:

\[ \begin{align*} & B_{4}=15 & B_{10}=115975 & & B_{100}\approx 4.76 \times 10^{115} \end{align*} \]

In concrete cases, it is therefore impossible to examine all the different possible alternative stratifications. The Genetic Algorithm allows to explore the universe of stratification in a very efficient way in order to find the optimal (or close to optimal) solution.

In planning a strafied sampling for a given survey, proceed as follows:

The application of the genetic algorithm is based on the following steps:

References

Baillargeon, Sophie, and Louis-Paul Rivest. 2014. Stratification: Univariate Stratification of Survey Populations. https://CRAN.R-project.org/package=stratification.

Ballin, M., and G. Barcaroli. 2013. “Joint Determination of Optimal Stratification and Sample Allocation Using Genetic Algorithm.” Survey Methodology 39: 369–93.

Ballin, Marco, and Giulio Barcaroli. 2016. “Optimization of Stratified Sampling with the R Package SamplingStrata: Applications to Network Data.” Computational Network Analysis with R: Applications in Biology, Medicine and Chemistry. John Wiley & Sons.

Barcaroli, G. 2014. “SamplingStrata: An R Package for the Optimization of Stratified Sampling.” Journal of Statistical Software 61 (4): 1–24. https://www.jstatsoft.org/article/view/v061i04.

Hartigan, J. A., and M. A. Wong. 1979. “A K-Means Clustering Algorithm.” Applied Statistics 28: 100–108.

O’Luing, M., S. Prestwich, and S. Armagan Tarim. 2017. “Constructing Strata to Solve Sample Allocation Problems by Grouping Genetic Algorithm.” arXiv:1709.03076.

Willighagen, E. 2005. Genalg: R Based Genetic Algorithm. https://CRAN.R-project.org/package=genalg.