Convert a survey design object to a generalized bootstrap replicate design
Source:R/generalized_bootstrap.R
as_gen_boot_design.Rd
Converts a survey design object to a replicate design object with replicate weights formed using the generalized bootstrap method. The generalized survey bootstrap is a method for forming bootstrap replicate weights from a textbook variance estimator, provided that the variance estimator can be represented as a quadratic form whose matrix is positive semidefinite (this covers a large class of variance estimators).
Usage
as_gen_boot_design(
design,
variance_estimator = NULL,
aux_var_names = NULL,
replicates = 500,
tau = "auto",
exact_vcov = FALSE,
psd_option = "warn",
mse = getOption("survey.replicates.mse"),
compress = TRUE
)
Arguments
- design
A survey design object created using the 'survey' (or 'srvyr') package, with class
'survey.design'
or'svyimputationList'
.- variance_estimator
The name of the variance estimator whose quadratic form matrix should be created. See variance-estimators for a detailed description of each variance estimator. Options include:
"Yates-Grundy":
The Yates-Grundy variance estimator based on first-order and second-order inclusion probabilities."Horvitz-Thompson":
The Horvitz-Thompson variance estimator based on first-order and second-order inclusion probabilities."Poisson Horvitz-Thompson":
The Horvitz-Thompson variance estimator based on assuming Poisson sampling, with first-order inclusion probabilities inferred from the sampling probabilities of the survey design object."Stratified Multistage SRS":
The usual stratified multistage variance estimator based on estimating the variance of cluster totals within strata at each stage."Ultimate Cluster":
The usual variance estimator based on estimating the variance of first-stage cluster totals within first-stage strata."Deville-1":
A variance estimator for unequal-probability sampling without replacement, described in Matei and Tillé (2005) as "Deville 1"."Deville-2":
A variance estimator for unequal-probability sampling without replacement, described in Matei and Tillé (2005) as "Deville 2"."Deville-Tille":
A variance estimator useful for balanced sampling designs, proposed by Deville and Tillé (2005)."SD1":
The non-circular successive-differences variance estimator described by Ash (2014), sometimes used for variance estimation for systematic sampling."SD2":
The circular successive-differences variance estimator described by Ash (2014). This estimator is the basis of the "successive-differences replication" estimator commonly used for variance estimation for systematic sampling."BOSB":
The kernel-based variance estimator proposed by Breidt, Opsomer, and Sanchez-Borrego (2016) for use with systematic samples or other finely stratified designs. Uses the Epanechnikov kernel with the bandwidth automatically chosen to result in the smallest possible nonempty kernel window."Beaumont-Emond":
The variance estimator of Beaumont and Emond (2022) for multistage unequal-probability sampling without replacement.
- aux_var_names
(Only used if
variance_estimator = "Deville-Tille")
. A vector of the names of auxiliary variables used in sampling.- replicates
Number of bootstrap replicates (should be as large as possible, given computer memory/storage limitations). A commonly-recommended default is 500.
- tau
Either
"auto"
, or a single number. This is the rescaling constant used to avoid negative weights through the transformation \(\frac{w + \tau - 1}{\tau}\), where \(w\) is the original weight and \(\tau\) is the rescaling constanttau
.
Iftau="auto"
, the rescaling factor is determined automatically as follows: if all of the adjustment factors are nonnegative, thentau
is set equal to 1; otherwise,tau
is set to the smallest value needed to rescale the adjustment factors such that they are all at least0.01
.- exact_vcov
If
exact_vcov=TRUE
, the replicate factors will be generated such that variance estimates for totals exactly match the results from the target variance estimator. This requires thatnum_replicates
exceeds the rank ofSigma
. The replicate factors are generated by applying PCA-whitening to a collection of draws from a multivariate Normal distribution, then applying a coloring transformation to the whitened collection of draws.- psd_option
Either
"warn"
(the default) or"error"
. This option specifies what will happen if the target variance estimator has a quadratic form matrix which is not positive semidefinite. This can occasionally happen, particularly for two-phase designs.
Ifpsd_option="error"
, then an error message will be displayed.
Ifpsd_option="warn"
, then a warning message will be displayed, and the quadratic form matrix will be approximated by the most similar positive semidefinite matrix. This approximation was suggested by Beaumont and Patak (2012), who note that this is conservative in the sense of producing overestimates of variance. Beaumont and Patak (2012) argue that this overestimation is expected to be small in magnitude. Seeget_nearest_psd_matrix
for details of the approximation.- mse
If
TRUE
, compute variances from sums of squares around the point estimate from the full-sample weights. IfFALSE
, compute variances from sums of squares around the mean estimate from the replicate weights.- compress
This reduces the computer memory required to represent the replicate weights and has no impact on estimates.
Value
A replicate design object, with class svyrep.design
, which can be used with the usual functions,
such as svymean()
or svyglm()
.
Use weights(..., type = 'analysis')
to extract the matrix of replicate weights.
Use as_data_frame_with_weights()
to convert the design object to a data frame with columns
for the full-sample and replicate weights.
Statistical Details
Let \(v( \hat{T_y})\) be the textbook variance estimator for an estimated population total \(\hat{T}_y\) of some variable \(y\). The base weight for case \(i\) in our sample is \(w_i\), and we let \(\breve{y}_i\) denote the weighted value \(w_iy_i\). Suppose we can represent our textbook variance estimator as a quadratic form: \(v(\hat{T}_y) = \breve{y}\Sigma\breve{y}^T\), for some \(n \times n\) matrix \(\Sigma\). The only constraint on \(\Sigma\) is that, for our sample, it must be symmetric and positive semidefinite.
The bootstrapping process creates \(B\) sets of replicate weights, where the \(b\)-th set of replicate weights is a vector of length \(n\) denoted \(\mathbf{a}^{(b)}\), whose \(k\)-th value is denoted \(a_k^{(b)}\). This yields \(B\) replicate estimates of the population total, \(\hat{T}_y^{*(b)}=\sum_{k \in s} a_k^{(b)} \breve{y}_k\), for \(b=1, \ldots B\), which can be used to estimate sampling variance.
$$ v_B\left(\hat{T}_y\right)=\frac{\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2}{B} $$
This bootstrap variance estimator can be written as a quadratic form:
$$ v_B\left(\hat{T}_y\right) =\mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}} $$ where $$ \boldsymbol{\Sigma}_B = \frac{\sum_{b=1}^B\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)\left(\mathbf{a}^{(b)}-\mathbf{1}_n\right)^{\prime}}{B} $$
Note that if the vector of adjustment factors \(\mathbf{a}^{(b)}\) has expectation \(\mathbf{1}_n\) and variance-covariance matrix \(\boldsymbol{\Sigma}\),
then we have the bootstrap expectation \(E_{*}\left( \boldsymbol{\Sigma}_B \right) = \boldsymbol{\Sigma}\). Since the bootstrap process takes the sample values \(\breve{y}\) as fixed, the bootstrap expectation of the variance estimator is \(E_{*} \left( \mathbf{\breve{y}}^{\prime}\Sigma_B \mathbf{\breve{y}}\right)= \mathbf{\breve{y}}^{\prime}\Sigma \mathbf{\breve{y}}\).
Thus, we can produce a bootstrap variance estimator with the same expectation as the textbook variance estimator simply by randomly generating \(\mathbf{a}^{(b)}\) from a distribution with the following two conditions:
Condition 1: \(\quad \mathbf{E}_*(\mathbf{a})=\mathbf{1}_n\)
Condition 2: \(\quad \mathbf{E}_*\left(\mathbf{a}-\mathbf{1}_n\right)\left(\mathbf{a}-\mathbf{1}_n\right)^{\prime}=\mathbf{\Sigma}\)
While there are multiple ways to generate adjustment factors satisfying these conditions,
the simplest general method is to simulate from a multivariate normal distribution: \(\mathbf{a} \sim MVN(\mathbf{1}_n, \boldsymbol{\Sigma})\).
This is the method used by this function.
Details on Rescaling to Avoid Negative Adjustment Factors
Let \(\mathbf{A} = \left[ \mathbf{a}^{(1)} \cdots \mathbf{a}^{(b)} \cdots \mathbf{a}^{(B)} \right]\) denote the \((n \times B)\) matrix of bootstrap adjustment factors. To eliminate negative adjustment factors, Beaumont and Patak (2012) propose forming a rescaled matrix of nonnegative replicate factors \(\mathbf{A}^S\) by rescaling each adjustment factor \(a_k^{(b)}\) as follows: $$ a_k^{S,(b)} = \frac{a_k^{(b)} + \tau - 1}{\tau} $$ where \(\tau \geq 1 - a_k^{(b)} \geq 1\) for all \(k\) in \(\left\{ 1,\ldots,n \right\}\) and all \(b\) in \(\left\{1, \ldots, B\right\}\).
The value of \(\tau\) can be set based on the realized adjustment factor matrix \(\mathbf{A}\) or by choosing \(\tau\) prior to generating the adjustment factor matrix \(\mathbf{A}\) so that \(\tau\) is likely to be large enough to prevent negative bootstrap weights.
If the adjustment factors are rescaled in this manner, it is important to adjust the scale factor used in estimating the variance with the bootstrap replicates, which becomes \(\frac{\tau^2}{B}\) instead of \(\frac{1}{B}\). $$ \textbf{Prior to rescaling: } v_B\left(\hat{T}_y\right) = \frac{1}{B}\sum_{b=1}^B\left(\hat{T}_y^{*(b)}-\hat{T}_y\right)^2 $$ $$ \textbf{After rescaling: } v_B\left(\hat{T}_y\right) = \frac{\tau^2}{B}\sum_{b=1}^B\left(\hat{T}_y^{S*(b)}-\hat{T}_y\right)^2 $$ When sharing a dataset that uses rescaled weights from a generalized survey bootstrap, the documentation for the dataset should instruct the user to use replication scale factor \(\frac{\tau^2}{B}\) rather than \(\frac{1}{B}\) when estimating sampling variances.
Two-Phase Designs
For a two-phase design, variance_estimator
should be a list of variance estimators' names,
with two elements, such as list('Ultimate Cluster', 'Poisson Horvitz-Thompson')
.
In two-phase designs, only the following estimators may be used for the second phase:
"Ultimate Cluster"
"Stratified Multistage SRS"
"Poisson Horvitz-Thompson"
For statistical details on the handling of two-phase designs, see the documentation for make_twophase_quad_form.
References
The generalized survey bootstrap was first proposed by Bertail and Combris (1997).
See Beaumont and Patak (2012) for a clear overview of the generalized survey bootstrap.
The generalized survey bootstrap represents one strategy for forming replication variance estimators
in the general framework proposed by Fay (1984) and Dippo, Fay, and Morganstein (1984).
- Ash, S. (2014). "Using successive difference replication for estimating variances."
Survey Methodology, Statistics Canada, 40(1), 47–59.
- Bellhouse, D.R. (1985). "Computing Methods for Variance Estimation in Complex Surveys."
Journal of Official Statistics, Vol.1, No.3.
- Breidt, F. J., Opsomer, J. D., & Sanchez-Borrego, I. (2016).
"Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata."
Journal of the American Statistical Association, 111(514), 822–833. https://doi.org/10.1080/01621459.2015.1058264
- Beaumont, Jean-François, and Zdenek Patak. 2012. “On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling: Generalized Bootstrap for Sample Surveys.” International Statistical Review 80 (1): 127–48. https://doi.org/10.1111/j.1751-5823.2011.00166.x.
- Bertail, and Combris. 1997. “Bootstrap Généralisé d’un Sondage.” Annales d’Économie Et de Statistique, no. 46: 49. https://doi.org/10.2307/20076068.
- Deville, J.‐C., and Tillé, Y. (2005). "Variance approximation under balanced sampling."
Journal of Statistical Planning and Inference, 128, 569–591.
- Dippo, Cathryn, Robert Fay, and David Morganstein. 1984. “Computing Variances from Complex Samples with Replicate Weights.” In, 489–94. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_094.pdf.
- Fay, Robert. 1984. “Some Properties of Estimates of Variance Based on Replication Methods.” In, 495–500. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/papers/1984_095.pdf.
- Matei, Alina, and Yves Tillé. (2005).
“Evaluation of Variance Approximations and Estimators
in Maximum Entropy Sampling with Unequal Probability and Fixed Sample Size.”
Journal of Official Statistics, 21(4):543–70.
See also
Use estimate_boot_reps_for_target_cv
to help choose the number of bootstrap replicates.
For greater customization of the method, make_quad_form_matrix
can be used to
represent several common variance estimators as a quadratic form's matrix,
which can then be used as an input to make_gen_boot_factors
.
The function rescale_reps
is used to implement
the rescaling of the bootstrap adjustment factors.
See variance-estimators for a description of each variance estimator.
Examples
if (FALSE) { # \dontrun{
library(survey)
# Example 1: Bootstrap based on the Yates-Grundy estimator ----
set.seed(2014)
data('election', package = 'survey')
## Create survey design object
pps_design_yg <- svydesign(
data = election_pps,
id = ~1, fpc = ~p,
pps = ppsmat(election_jointprob),
variance = "YG"
)
## Convert to generalized bootstrap replicate design
gen_boot_design_yg <- pps_design_yg |>
as_gen_boot_design(variance_estimator = "Yates-Grundy",
replicates = 1000, tau = "auto")
svytotal(x = ~ Bush + Kerry, design = pps_design_yg)
svytotal(x = ~ Bush + Kerry, design = gen_boot_design_yg)
# Example 2: Bootstrap based on the successive-difference estimator ----
data('library_stsys_sample', package = 'svrep')
## First, ensure data are sorted in same order as was used in sampling
library_stsys_sample <- library_stsys_sample |>
sort_by(~ SAMPLING_SORT_ORDER)
## Create a survey design object
design_obj <- svydesign(
data = library_stsys_sample,
strata = ~ SAMPLING_STRATUM,
ids = ~ 1,
fpc = ~ STRATUM_POP_SIZE
)
## Convert to generalized bootstrap replicate design
gen_boot_design_sd2 <- as_gen_boot_design(
design = design_obj,
variance_estimator = "SD2",
replicates = 2000
)
## Estimate sampling variances
svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = gen_boot_design_sd2)
svytotal(x = ~ TOTSTAFF, na.rm = TRUE, design = design_obj)
# Example 3: Two-phase sample ----
# -- First stage is stratified systematic sampling,
# -- second stage is response/nonresponse modeled as Poisson sampling
nonresponse_model <- glm(
data = library_stsys_sample,
family = quasibinomial('logit'),
formula = I(RESPONSE_STATUS == "Survey Respondent") ~ 1,
weights = 1/library_stsys_sample$SAMPLING_PROB
)
library_stsys_sample[['RESPONSE_PROPENSITY']] <- predict(
nonresponse_model,
newdata = library_stsys_sample,
type = "response"
)
twophase_design <- twophase(
data = library_stsys_sample,
# Identify cases included in second phase sample
subset = ~ I(RESPONSE_STATUS == "Survey Respondent"),
strata = list(~ SAMPLING_STRATUM, NULL),
id = list(~ 1, ~ 1),
probs = list(NULL, ~ RESPONSE_PROPENSITY),
fpc = list(~ STRATUM_POP_SIZE, NULL),
)
twophase_boot_design <- as_gen_boot_design(
design = twophase_design,
variance_estimator = list(
"SD2", "Poisson Horvitz-Thompson"
)
)
svytotal(x = ~ LIBRARIA, design = twophase_boot_design)
} # }