In survey sampling, a population is a well-defined set of identifiable and observable units relevant to a survey’s objective. Information about its parameters can be obtained through a Census (complete enumeration) or by selecting a representative sample using sampling methods. Populations may be finite or infinite, and sampling is performed on defined units listed in a sampling frame.
Simple Random Sampling Without Replacement (SRSWOR) is a fundamental sampling design, where every subset of size \(n\) from a population of size \(N\) has equal chance of being selected, with no repeated selections.
Two primary paradigms in inference are:
In both, auxiliary information can improve estimation accuracy. Common estimators leveraging this include:
This study uses Monte Carlo simulation to compare their performance in estimating the population total under SRSWOR sampling design.
Let:
The Horvitz-Thompson (HT) estimator for a finite population total is given by:
\[ \hat{Y}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i} \]
where
Under SRSWOR, formula can be written as
\[ \hat{Y}_{HT} = N \cdot \bar{y}_s \]
This estimator is unbiased under SRSWOR.
\[ \hat{Y}_{R} = X \cdot \frac{\bar{y}_s}{\bar{x}_s} \]
Efficient when \(y\) and \(x\) are positively correlated and the relationship passes through the origin.
\[ \hat{Y}_{reg} = N \left[\bar{y}_s + b(\bar{X} - \bar{x}_s)\right] \]
where
\[ b = \frac{\sum (x_i - \bar{x}_s)(y_i - \bar{y}_s)}{\sum (x_i - \bar{x}_s)^2} \]
Performs well under linear relationship between \(x\) and \(y\).
Metric | Description |
---|---|
Est_Total |
Average estimated total over all simulations |
RB |
Relative Bias (%): \(\frac{\text{Mean Estimate} - \text{True Total}}{\text{True Total}} \times 100\) |
RRMSE |
Relative Root Mean Square Error (%): \(\frac{\sqrt{\text{MSE}}}{\text{True Total}} \times 100\) |
Skewness |
Asymmetry of estimator distribution |
Kurtosis |
Peakedness of estimator distribution |
CR |
Coverage Rate: Proportion of times the 95% CI includes the true total |
For_var |
Theoretical (design-based) variance |
Emp_var |
Empirical variance from simulation |
Est_var |
Mean of estimated variances across simulations |
perc_CV |
Coefficient of Variation: \(\frac{SD}{Mean} \times 100\) |
A population is synthetically generated with size \(N = 1000\). An auxiliary variable \(x\) is drawn from a normal distribution, and the study variable \(y\) is linearly related to \(x\) with added noise.
A sample of size \(n = 100\) is drawn without replacement from the population, and using this sample, an estimate finite population total of all estimators has been calculated. This procedure was repeated independently, say 1000 times (simulation number).Measures like percentage relative bias (%RB) and percentage relative root mean square error (%RRMSE) have been used to compare the performance of the estimators like HT, Ratio and Regression estimators.
survey_sim_est
This function performs a simulation study to evaluate the performance of three survey estimators: Horvitz-Thompson (HT), Ratio, and Regression estimators. It generates statistics like Relative Bias (RB), Relative Root Mean Square Error (RRMSE), Coefficient of Variation (CV), Skewness, Kurtosis, and Coverage Probability over a given number of simulations.
survey_sim_est <- function(Y, X, n = 40, SimNo = 500, seed = 123) {
if (length(Y) != length(X)) stop("Y and X must be of the same length.")
set.seed(seed)
N <- length(Y)
Y.total <- sum(Y)
X.total <- sum(X)
True.Total <- rep(Y.total, SimNo)
data <- data.frame(ID = 1:N, Y = Y, X = X)
est1 <- est2 <- est3 <- var_est1 <- var_est2 <- var_est3 <- numeric(SimNo)
for (h in 1:SimNo) {
set.seed(h)
sa <- sample(1:N, n, replace = FALSE)
samp <- data[sa, ]
di <- rep(N / n, n)
y_samp <- samp$Y
x_samp <- samp$X
s2_y <- var(y_samp)
s2_x <- var(x_samp)
rho_hat <- cor(y_samp, x_samp)
R_hat <- mean(y_samp) / mean(x_samp)
# HT Estimator
est1[h] <- sum(di * y_samp)
var_est1[h] <- N^2 * ((1 / n) - (1 / N)) * s2_y
# Ratio Estimator
ratio1 <- di * y_samp
ratio2 <- di * x_samp
est2[h] <- (sum(ratio1) / sum(ratio2)) * X.total
var_est2[h] <- N^2 * ((1 / n) - (1 / N)) * (s2_y + (R_hat^2 * s2_x) - (2 * R_hat * rho_hat * sqrt(s2_y * s2_x)))
# Regression Estimator
b_hat <- rho_hat * sqrt(s2_y) / sqrt(s2_x)
est3[h] <- est1[h] + b_hat * (X.total - sum(ratio2))
var_est3[h] <- N^2 * ((1 / n) - (1 / N)) * s2_y * (1 - rho_hat^2)
}
# Population-level parameters
pop_s2_y <- var(Y)
pop_s2_x <- var(X)
R_pop <- mean(Y) / mean(X)
rho_pop <- cor(Y, X)
var_HT <- N^2 * ((1 / n) - (1 / N)) * pop_s2_y
var_ratio <- N^2 * ((1 / n) - (1 / N)) * (pop_s2_y + R_pop^2 * pop_s2_x - 2 * R_pop * rho_pop * sqrt(pop_s2_y * pop_s2_x))
var_reg <- N^2 * ((1 / n) - (1 / N)) * pop_s2_y * (1 - rho_pop^2)
# Evaluation metrics
RB <- 100 * colMeans(cbind(est1, est2, est3) - True.Total) / mean(True.Total)
RRMSE <- 100 * sqrt(colMeans((cbind(est1, est2, est3) - True.Total)^2)) / mean(True.Total)
CV <- 100 * apply(cbind(est1, est2, est3), 2, sd) / colMeans(cbind(est1, est2, est3))
skew <- apply(cbind(est1, est2, est3), 2, moments::skewness)
kurt <- apply(cbind(est1, est2, est3), 2, moments::kurtosis)
emp_var <- apply(cbind(est1, est2, est3), 2, var)
est_var <- c(mean(var_est1), mean(var_est2), mean(var_est3))
coverage <- colMeans(abs(cbind(est1, est2, est3) - True.Total) <= 1.96 * sqrt(cbind(var_est1, var_est2, var_est3)))
result <- data.frame(
Est_Total = c(mean(est1), mean(est2), mean(est3)),
RB = RB,
RRMSE = RRMSE,
Skewness = skew,
Kurtosis = kurt,
Coverage = coverage,
PopVar = c(var_HT, var_ratio, var_reg),
EmpVar = emp_var,
EstVar = est_var,
CV = CV
)
rownames(result) <- c("HT", "Ratio", "Regression")
return(round(result, 3))
}
set.seed(123)
N <- 1000
X <- rnorm(N, mean = 50, sd = 10)
Y <- 3 + 1.5 * X + rnorm(N, mean = 0, sd = 10)
result <- survey_sim_est(Y = Y, X = X, n = 50, SimNo = 100)
print(result)
## Est_Total RB RRMSE Skewness Kurtosis Coverage PopVar EmpVar
## HT 78520.44 -0.186 3.344 0.000 3.750 0.93 6634795 6966400
## Ratio 78656.15 -0.013 1.980 -0.113 3.350 0.95 1923185 2449511
## Regression 78673.27 0.009 1.997 -0.120 3.347 0.95 1922454 2492887
## EstVar CV
## HT 6631499 3.361
## Ratio 1903436 1.990
## Regression 1867867 2.007
Coverage Rate near 95% indicates good interval
performance, while low RB
and RRMSE
signify
estimators unbiasedness and precision.
This simulation illustrates the benefits of using auxiliary information in improving the estimation of finite population totals. While the Horvitz-Thompson estimator offers unbiasedness under design-based framework, the ratio and regression estimators leverage the auxiliary variable to reduce variance significantly when assumptions are met.