Convert Survey Design to Generalized Bootstrap Replicate Design

Converts a survey design object to a replicate design object with replicate weights formed using the generalized bootstrap method. The generalized survey bootstrap is a method for forming bootstrap replicate weights from a textbook variance estimator, provided that the variance estimator can be represented as a quadratic form whose matrix is positive semidefinite (this covers a large class of variance estimators).

Usage

as_gen_boot_design(
  design,
  variance_estimator = NULL,
  aux_var_names = NULL,
  replicates = 500,
  tau = 1,
  exact_vcov = FALSE,
  psd_option = "warn",
  mse = getOption("survey.replicates.mse"),
  compress = TRUE
)

Arguments

design

A survey design object created using the 'survey' (or 'srvyr') package, with class 'survey.design' or 'svyimputationList'.

variance_estimator

The name of the variance estimator whose quadratic form matrix should be created. See variance-estimators for a detailed description of each variance estimator. Options include:

"Yates-Grundy":
The Yates-Grundy variance estimator based on first-order and second-order inclusion probabilities.
"Horvitz-Thompson":
The Horvitz-Thompson variance estimator based on first-order and second-order inclusion probabilities.
"Poisson Horvitz-Thompson":
The Horvitz-Thompson variance estimator based on assuming Poisson sampling, with first-order inclusion probabilities inferred from the sampling probabilities of the survey design object.
"Stratified Multistage SRS":
The usual stratified multistage variance estimator based on estimating the variance of cluster totals within strata at each stage.
"Ultimate Cluster":
The usual variance estimator based on estimating the variance of first-stage cluster totals within first-stage strata.
"Deville-1":
A variance estimator for unequal-probability sampling without replacement, described in Matei and Tillé (2005) as "Deville 1".
"Deville-2":
A variance estimator for unequal-probability sampling without replacement, described in Matei and Tillé (2005) as "Deville 2".
"Deville-Tille":
A variance estimator useful for balanced sampling designs, proposed by Deville and Tillé (2005).
"SD1":
The non-circular successive-differences variance estimator described by Ash (2014), sometimes used for variance estimation for systematic sampling.
"SD2":
The circular successive-differences variance estimator described by Ash (2014). This estimator is the basis of the "successive-differences replication" estimator commonly used for variance estimation for systematic sampling.
"BOSB":
The kernel-based variance estimator proposed by Breidt, Opsomer, and Sanchez-Borrego (2016) for use with systematic samples or other finely stratified designs. Uses the Epanechnikov kernel with the bandwidth automatically chosen to result in the smallest possible nonempty kernel window.
"Beaumont-Emond":
The variance estimator of Beaumont and Emond (2022) for multistage unequal-probability sampling without replacement.

aux_var_names

(Only used if variance_estimator = "Deville-Tille"). A vector of the names of auxiliary variables used in sampling.

replicates

Number of bootstrap replicates (should be as large as possible, given computer memory/storage limitations). A commonly-recommended default is 500.

tau

Either "auto", or a single number; the default value is 1. This is the rescaling constant used to avoid negative weights through the transformation $\frac{w + \tau - 1}{\tau}$, where $w$ is the original weight and $\tau$ is the rescaling constant tau.
If tau="auto", the rescaling factor is determined automatically as follows: if all of the adjustment factors are nonnegative, then tau is set equal to 1; otherwise, tau is set to the smallest value needed to rescale the adjustment factors such that they are all at least 0.01. Instead of using tau="auto", the user can instead use the function rescale_replicates() to rescale the replicates later.

exact_vcov

If exact_vcov=TRUE, the replicate factors will be generated such that variance estimates for totals exactly match the results from the target variance estimator. This requires that num_replicates exceeds the rank of Sigma. The replicate factors are generated by applying PCA-whitening to a collection of draws from a multivariate Normal distribution, then applying a coloring transformation to the whitened collection of draws.

psd_option

Either "warn" (the default) or "error". This option specifies what will happen if the target variance estimator has a quadratic form matrix which is not positive semidefinite. This can occasionally happen, particularly for two-phase designs.
If psd_option="error", then an error message will be displayed.
If psd_option="warn", then a warning message will be displayed, and the quadratic form matrix will be approximated by the most similar positive semidefinite matrix. This approximation was suggested by Beaumont and Patak (2012), who note that this is conservative in the sense of producing overestimates of variance. Beaumont and Patak (2012) argue that this overestimation is expected to be small in magnitude. See get_nearest_psd_matrix for details of the approximation.

mse

If TRUE, compute variances from sums of squares around the point estimate from the full-sample weights. If FALSE, compute variances from sums of squares around the mean estimate from the replicate weights.

compress

This reduces the computer memory required to represent the replicate weights and has no impact on estimates.

Value

A replicate design object, with class svyrep.design, which can be used with the usual functions, such as svymean() or svyglm().

Use weights(..., type = 'analysis') to extract the matrix of replicate weights.

Use as_data_frame_with_weights() to convert the design object to a data frame with columns for the full-sample and replicate weights.

Statistical Details

Let $v( \hat{T_y})$ be the textbook variance estimator for an estimated population total $\hat{T}_y$ of some variable $y$. The base weight for case $i$ in our sample is $w_i$, and we let $\breve{y}_i$ denote the weighted value $w_iy_i$. Suppose we can represent our textbook variance estimator as a quadratic form: $v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T$, for some $n \times n$ matrix $\Sigma$. The only constraint on $\Sigma$ is that, for our sample, it must be symmetric and positive semidefinite.

The bootstrapping process creates $B$ sets of replicate weights, where the $b$-th set of replicate weights is a vector of length $n$ denoted $\mathbf{a}^{(b)}$, whose $k$-th value is denoted $a_k^{(b)}$. This yields $B$ replicate estimates of the population total, $\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k$, for $b=1, \ldots B$, which can be used to estimate sampling variance.

$$ v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B} $$

This bootstrap variance estimator can be written as a quadratic form:

$$ v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}} $$ where $$ \boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B} $$

Note that if the vector of adjustment factors $\mathbf{a}^{(b)}$ has expectation $\mathbf{1}_n$ and variance-covariance matrix $\boldsymbol{\Sigma}$, then we have the bootstrap expectation $E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}$. Since the bootstrap process takes the sample values $\breve{y}$ as fixed, the bootstrap expectation of the variance estimator is $E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}$. Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating $\mathbf{a}^{(b)}$ from a distribution with the following two conditions:
Condition 1: $\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n$
Condition 2: $\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}$

While there are multiple ways to generate adjustment factors satisfying these conditions, the simplest general method is to simulate from a multivariate normal distribution: $\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})$. This is the method used by this function.

Details on Rescaling to Avoid Negative Adjustment Factors

Let $\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]$ denote the $(n \times B)$ matrix of bootstrap adjustment factors. To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors $\mathbf{A}^S$ by rescaling each adjustment factor $a_k^{(b)}$ as follows: $$ a_k^{S,(b)} = \frac{a_k^{(b)} + \tau - 1}{\tau} $$ where $\tau \geq 1 - a_k^{(b)} \geq 1$ for all $k$ in $\left\{ 1,\ldots,n \right\}$ and all $b$ in $\left\{1, \ldots, B\right\}$.

The value of $\tau$ can be set based on the realized adjustment factor matrix $\mathbf{A}$ or by choosing $\tau$ prior to generating the adjustment factor matrix $\mathbf{A}$ so that $\tau$ is likely to be large enough to prevent negative bootstrap weights.

If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes $\frac{\tau^2}{B}$ instead of $\frac{1}{B}$. $$ \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) = \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2 $$ $$ \textbf{After rescaling: } v_B\left(\hat{T}_y\right) = \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2 $$ When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor $\frac{\tau^2}{B}$ rather than $\frac{1}{B}$ when estimating sampling variances. This rescaling method does not affect variance estimates for linear statistics, but its impact on non-smooth statistics such as quantiles is unclear.

Two-Phase Designs

For a two-phase design, variance_estimator should be a list of variance estimators' names, with two elements, such as list('Ultimate Cluster', 'Poisson Horvitz-Thompson'). In two-phase designs, only the following estimators may be used for the second phase:

"Ultimate Cluster"
"Stratified Multistage SRS"
"Poisson Horvitz-Thompson"

For statistical details on the handling of two-phase designs, see the documentation for make_twophase_quad_form.

References

The generalized survey bootstrap was first proposed by Bertail and Combris (1997). See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap. The generalized survey bootstrap represents one strategy for forming replication variance estimators in the general framework proposed by Fay (1984) and Dippo, Fay, and Morganstein (1984).

- Ash, S. (2014). "Using successive difference replication for estimating variances." Survey Methodology, Statistics Canada, 40(1), 47-59.

- Bellhouse, D.R. (1985). "Computing Methods for Variance Estimation in Complex Surveys." Journal of Official Statistics, Vol.1, No.3.

- Breidt, F. J., Opsomer, J. D., & Sanchez-Borrego, I. (2016). "Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata." Journal of the American Statistical Association, 111(514), 822-833. https://doi.org/10.1080/01621459.2015.1058264

- Beaumont, Jean-François, and Zdenek Patak. 2012. "On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling: Generalized Bootstrap for Sample Surveys." International Statistical Review 80 (1): 127-48. https://doi.org/10.1111/j.1751-5823.2011.00166.x.

- Bertail, and Combris. 1997. "Bootstrap Généralisé d'un Sondage." Annales d'Économie Et de Statistique, no. 46: 49. https://doi.org/10.2307/20076068.

- Deville, J. C., and Tillé, Y. (2005). "Variance approximation under balanced sampling." Journal of Statistical Planning and Inference, 128, 569-591.

- Dippo, Cathryn, Robert Fay, and David Morganstein. 1984. "Computing Variances from Complex Samples with Replicate Weights." In, 489-94. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_094.pdf.

- Fay, Robert. 1984. "Some Properties of Estimates of Variance Based on Replication Methods." In, 495-500. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_095.pdf.

- Matei, Alina, and Yves Tillé. (2005). "Evaluation of Variance Approximations and Estimators in Maximum Entropy Sampling with Unequal Probability and Fixed Sample Size." Journal of Official Statistics, 21(4):543-70.

Examples

# \donttest{
library(survey)

# Example 1: Bootstrap based on the Yates-Grundy estimator ----
   set.seed(2014)

   data('election', package = 'survey')

   ## Create survey design object
   pps_design_yg <- svydesign(
     data = election_pps,
     id = ~1, fpc = ~p,
     pps = ppsmat(election_jointprob),
     variance = "YG"
   )

   ## Convert to generalized bootstrap replicate design
   gen_boot_design_yg <- pps_design_yg |>
     as_gen_boot_design(variance_estimator = "Yates-Grundy",
                        replicates = 1000, tau = "auto")

   svytotal(x = ~ Bush + Kerry, design = pps_design_yg)
#>          total      SE
#> Bush  64518472 2406526
#> Kerry 51202102 2408091
   svytotal(x = ~ Bush + Kerry, design = gen_boot_design_yg)
#>          total      SE
#> Bush  64518472 2422125
#> Kerry 51202102 2424414

# Example 2: Bootstrap based on the successive-difference estimator ----

   data('library_stsys_sample', package = 'svrep')

   ## First, ensure data are sorted in same order as was used in sampling
   library_stsys_sample <- library_stsys_sample |>
     sort_by(~ SAMPLING_SORT_ORDER)

   ## Create a survey design object
   design_obj <- svydesign(
     data = library_stsys_sample,
     strata = ~ SAMPLING_STRATUM,
     ids = ~ 1,
     fpc = ~ STRATUM_POP_SIZE
   )

   ## Convert to generalized bootstrap replicate design
   gen_boot_design_sd2 <- as_gen_boot_design(
     design = design_obj,
     variance_estimator = "SD2",
     replicates = 2000
   )
#> For `variance_estimator='SD2', assumes rows of data are sorted in the same order used in sampling.

   ## Estimate sampling variances
   svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2)
#>           total    SE
#> TOTSTAFF 180977 38367
   svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = design_obj)
#>           total    SE
#> TOTSTAFF 180977 39029

# Example 3: Two-phase sample ----
# -- First stage is stratified systematic sampling,
# -- second stage is response/nonresponse modeled as Poisson sampling

  nonresponse_model <- glm(
    data = library_stsys_sample,
    family = quasibinomial('logit'),
    formula = I(RESPONSE_STATUS == "Survey Respondent") ~ 1,
    weights = 1/library_stsys_sample$SAMPLING_PROB
  )

  library_stsys_sample[['RESPONSE_PROPENSITY']] <- predict(
    nonresponse_model,
    newdata = library_stsys_sample,
    type = "response"
  )

  twophase_design <- twophase(
    data = library_stsys_sample,
    # Identify cases included in second phase sample
    subset = ~ I(RESPONSE_STATUS == "Survey Respondent"),
    strata = list(~ SAMPLING_STRATUM, NULL),
    id = list(~ 1, ~ 1),
    probs = list(NULL, ~ RESPONSE_PROPENSITY),
    fpc = list(~ STRATUM_POP_SIZE, NULL),
  )

  twophase_boot_design <- as_gen_boot_design(
    design = twophase_design,
    variance_estimator = list(
      "SD2", "Poisson Horvitz-Thompson"
    )
  )
#> For `variance_estimator='SD2', assumes rows of data are sorted in the same order used in sampling.

  svytotal(x = ~ LIBRARIA, design = twophase_boot_design)
#>          total    SE
#> LIBRARIA 67983 14731

# }