nonprobsvy
:
an R package for modern statistical inference methods based on
non-probability samples
The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
The package allows for:
ncvreg
, Rcpp
,
RcppArmadillo
packages),survey
and srvyr
packages when probability sample is available (Lumley 2004, 2023; Freedman Ellis
and Schneider 2024),logit
,
probit
and cloglog
) and outcome
(gaussian
, binomial
and poisson
)
variables.Details on the use of the package can be found:
You can install the recent version of nonprobsvy
package
from main branch Github with:
::install_github("ncn-foreigners/nonprobsvy") remotes
or install the stable version from CRAN
install.packages("nonprobsvy")
or development version from the dev
branch
::install_github("ncn-foreigners/nonprobsvy@dev") remotes
Consider the following setting where two samples are available: non-probability (denoted as \(S_A\) ) and probability (denoted as \(S_B\)) where set of auxiliary variables (denoted as \(\boldsymbol{X}\)) is available for both sources while \(Y\) and \(\boldsymbol{d}\) (or \(\boldsymbol{w}\)) is present only in probability sample.
Sample | Auxiliary variables \(\boldsymbol{X}\) | Target variable \(Y\) | Design (\(\boldsymbol{d}\)) or calibrated (\(\boldsymbol{w}\)) weights | |
---|---|---|---|---|
\(S_A\) (non-probability) | 1 | \(\checkmark\) | \(\checkmark\) | ? |
… | \(\checkmark\) | \(\checkmark\) | ? | |
\(n_A\) | \(\checkmark\) | \(\checkmark\) | ? | |
\(S_B\) (probability) | \(n_A+1\) | \(\checkmark\) | ? | \(\checkmark\) |
… | \(\checkmark\) | ? | \(\checkmark\) | |
\(n_A+n_B\) | \(\checkmark\) | ? | \(\checkmark\) |
Suppose \(Y\) is the target variable, \(\boldsymbol{X}\) is a matrix of auxiliary variables, \(R\) is the inclusion indicator. Then, if we are interested in estimating the mean \(\bar{\tau}_Y\) or the sum \(\tau_Y\) of the of the target variable given the observed data set \((y_k, \boldsymbol{x}_k, R_k)\), we can approach this problem with the possible scenarios:
Estimator | Example code |
---|---|
Mass imputation based on regression imputation |
|
Inverse probability weighting |
|
Inverse probability weighting with calibration constraint |
|
Doubly robust estimator |
|
Estimator | Example code |
---|---|
Mass imputation based on regression imputation |
|
Mass imputation based on nearest neighbour imputation |
|
Mass imputation based on predictive mean matching |
|
Mass imputation based on regression imputation with variable selection (LASSO) |
|
Inverse probability weighting |
|
Inverse probability weighting with calibration constraint |
|
Inverse probability weighting with calibration constraint with variable selection (SCAD) |
|
Doubly robust estimator |
|
Doubly robust estimator with variable selection (SCAD) and bias minimization |
|
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
<- 1e6 ## 1000000
N <- 1000
n <- rnorm(n = N, mean = 1, sd = 1)
x1 <- rexp(n = N, rate = 1)
x2 <- rnorm(n = N) # rnorm(N)
epsilon <- 1 + x1 + x2 + epsilon
y1 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
y2 <- exp(x2)/(1+exp(x2))
p1 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
p2 <- rbinom(n = N, size = 1, prob = p1)
flag_bd1 <- as.numeric(1:N %in% sample(1:N, size = n))
flag_srs <- N/n
base_w_srs <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs, pop_size = N)
population <- N/sum(population$flag_bd1) base_w_bd
Declare svydesign
object with survey
package
<- svydesign(ids= ~1, weights = ~ base_w_srs,
sample_prob data = subset(population, flag_srs == 1),
fpc = ~ pop_size)
sample_prob#> Independent Sampling design
#> svydesign(ids = ~1, weights = ~base_w_srs, data = subset(population,
#> flag_srs == 1), fpc = ~pop_size)
or with the srvyr
package
<- srvyr::as_survey_design(.data = subset(population, flag_srs == 1),
sample_prob weights = base_w_srs)
sample_prob
design (with replacement)
Independent Sampling
Called via srvyr:
Sampling variables:
Data variables- x1 (dbl), x2 (dbl), y1 (dbl), y2 (dbl), p1 (dbl), p2 (dbl), base_w_srs (dbl), flag_bd1 (int), flag_srs (dbl)
Estimate population mean of y1
based on doubly robust
estimator using IPW with calibration constraints and we specify that
auxiliary variables should not be combined for the inference.
<- nonprob(
result_dr selection = ~ x2,
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_dr#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9500 (se=0.0414, ci=(2.8689, 3.0312))
#> - variable y2: 1.5762 (se=0.0498, ci=(1.4786, 1.6739))
Mass imputation estimator
<- nonprob(
result_mi outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_mi#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9498 (se=0.0420, ci=(2.8675, 3.0321))
#> - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6398))
Inverse probability weighting estimator
<- nonprob(
result_ipw selection = ~ x2,
target = ~y1+y2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)
Results
result_ipw#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9981 (se=0.0137, ci=(2.9713, 3.0249))
#> - variable y2: 1.5906 (se=0.0137, ci=(1.5639, 1.6174))
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
mirror server hosted at Truenetwork, Russian Federation.