Background

In complex survey data, standard logistic regression requires the use of survey design weights (accounting for stratification, clustering, and unequal probabilities of selection) to obtain design-based unbiased estimates. However, when estimating the association between an exposure and a binary outcome, the relationship may be confounded by measured covariates. To address this, propensity-score weighting or prognostic-score weighting can be applied in addition to survey weights. Propensity-score weights balance the distribution of confounders across exposure groups, while prognostic-score weights adjust for covariates predictive of the outcome. Incorporating these weights allows us to estimate the exposure–outcome association adjusted for confounding, while still respecting the complex survey design. Formally, this approach produces doubly weighted regression models, where the combined weights (survey × propensity/prognostic) yield estimates that are both design-consistent and confounding-adjusted. In this package, we implement methods for binary, multinomial, and continuous exposures following established approaches in the survey and causal inference literature (e.g., [Austin, 2011; Lunceford & Davidian, 2004]). These methods enable researchers to report adjusted odds ratios that reflect the association between exposure and outcome under both design-based and confounder-adjusted considerations ([Ahmed and Dwivedi, 2026]). ## Objective This vignette is intended as a methodological and practical guide for applied researchers using complex survey data to estimate causal effects and model discrimination measures under propensity- and prognostic-score weighting.

svyCausalGLM package

Several R packages support causal and regression analyses under either complex survey designs or confounding adjustment, but typically not both simultaneously. The survey package enables design-based regression using sampling weights, strata, and clustering, ensuring unbiased population-level inference under complex survey designs. In contrast, packages for propensity score methods, such as TWANG ([Ridgeway et al, 2022] ) and PSweight ([Zhou et al, 2020)]) those described in Propensity Score Weighting in R, primarily focus on confounding adjustment using propensity-based weights, generally assuming independent and identically distributed observations and not explicitly incorporating survey design features. As a result, existing software typically applies survey weights alone or propensity/prognostic weights alone, with limited guidance or implementation for combining these approaches within a unified analytical framework. This gap is particularly relevant for observational studies using nationally representative health surveys, where valid inference requires both adjustment for complex sampling designs and control of confounding due to observed covariates. Recently, Ahmed and Dwivedi (2026) proposed a Stata module that integrates propensity score and prognostic score weighting with survey weights, allowing researchers to simultaneously account for complex survey designs and confounding, while also removing additional covariate effects on the exposure or outcome. Building on this methodological framework, the svyCausalGLM package provides an R-based implementation of survey-weighted generalized linear models that incorporate propensity score weighting, prognostic score weighting, or standard covariate adjustment within a design-based inference paradigm.

What does this Vignette include?

This vignette demonstrates the use of svyCausalGLM for conducting survey-weighted regression analyses with additional causal adjustments, enabling estimation of adjusted associations between exposure and outcome that are both design-consistent and confounding-adjusted. The package also provides tools for assessing model discrimination through the estimation of the area under the receiver operating characteristic curve (AUC), along with corresponding confidence intervals, after incorporating survey design features and causal weighting adjustments. The svyCausalGLM package provides functions to fit survey-weighted logistic regression models using design-based inference. It supports propensity-score weighting and prognostic score adjustment while accounting for complex survey design (PSU, strata, weights). This vignette demonstrates the workflow for: final_svyglm() – Standard survey-weighted GLM final_prop_svyglm() – Propensity-weighted survey GLM (binary, multinomial, or continuous exposures) final_prog_svyglm() – Prognostic-weighted survey GLM viz_auc_svyglm()-Estimating and Plotting AUC ## Example 1 A synthetic dataset with 800 observations was generated to resemble a complex survey, including PSU, strata, and sampling weights. The data contain a binary exposure, continuous and categorical covariates (age, sex, BMI), and a binary outcome generated from a logistic model. The dataset is used for demonstration purposes only.

set.seed(123)
n <- 800
synthetic_svy <- data.frame(
  psu     = sample(1:80, n, replace = TRUE),
  strata  = sample(1:40, n, replace = TRUE),
  weight  = runif(n, 0.5, 3),
  exposure = rbinom(n, 1, 0.45),
  age     = round(rnorm(n, 50, 12)),
  sex     = factor(sample(c("Male", "Female"), n, replace = TRUE)),
  bmi     = round(rnorm(n, 27, 4), 1)
)
linpred <- -2 + 0.8 * synthetic_svy$exposure + 0.03 * synthetic_svy$age
synthetic_svy$outcome <- rbinom(n, 1, plogis(linpred))
head(synthetic_svy)
##   psu strata   weight exposure age    sex  bmi outcome
## 1  31     39 2.165017        0  36 Female 31.2       1
## 2  79     23 2.333317        0  86 Female 24.4       1
## 3  51      6 1.286851        1  45   Male 26.2       1
## 4  14     32 2.169295        1  57 Female 28.3       1
## 5  67     39 1.659717        0  61   Male 27.2       0
## 6  42     16 2.125707        1  46 Female 28.9       0

final_svyglm(): Survey-Weighted Logistic Regression

final_svyglm() fits a design-based survey-weighted logistic regression model using svyglm() with a quasi-binomial family, incorporating sampling weights, stratification, and clustering (PSUs) to produce population-representative estimates under complex survey designs (Lumley, 2010). The function estimates adjusted associations between a binary outcome and covariates, returning odds ratios, confidence intervals, and p-values in a publication ready format. It also producing specific twy-way interaction terms which facilitates assessing interaction among covariates. Unlike final_prop_svyglm() and final_prog_svyglm(), which further adjust for confounding through propensity score or prognostic score weighting, final_svyglm() relies solely on the original survey design weights for design-based inference. The function structure is given by

#final_svyglm(data, dep_var, covariates, id_var,  strata_var, weight_var, family = "binomial", level = 0.95, interaction_terms = NULL) 

Argument description

Argument Description
data Data frame containing survey data
dep_var Character. Binary outcome variable (0/1)
covariates Character vector of covariate names
id_var Character. Primary sampling unit (PSU) / cluster variable
strata_var Character. Strata variable
weight_var Character. Survey sampling weights
family Model family (currently "binomial" only)
level Confidence level for intervals (default = 0.95)
interaction_terms Optional interaction terms (e.g., "age:sex")
————————————————————————————

Important requirements

To ensure valid design-based inference, the following conditions must be met when using final_svyglm():

  1. Binary outcome: –The outcome variable specified by dep_var must be binary and coded as 0/1. Valid survey design variables The dataset must include: (i) a primary sampling unit (PSU) variable (id_var), (ii) a stratification variable (strata_var), and (iii) a positive survey weight variable (weight_var).
  2. Covariates present and non-missing: All covariates specified in covariates must exist in the data. Observations with missing values in the outcome, covariates, or survey design variables are automatically excluded prior to model fitting.
  3. Design-based interpretation: The model is fit using survey::svyglm() and supports design-based inference. Estimates reflect population-average associations, not prediction or causal effects.
  4. Model family: The function currently supports logistic regression via the quasi-binomial family, which provides robust variance estimation under complex survey designs.

Model output

final_svyglm() returns a list of class svyCausal containing the following elements:

model: The fitted survey-weighted logistic regression model (svyglm object), incorporating sampling weights, stratification, and clustering.

OR_table: A publication-ready table reporting odds ratios, confidence intervals, and p-values for all model covariates.

outcome: The observed binary outcome vector used in the fitted model after applying complete-case filtering.

predictions: Predicted outcome probabilities from the survey-weighted model.

final_weights: The final analysis weights extracted from the survey design object used in model estimation.

design: The survey design object (svydesign) used to fit the model, allowing further post-estimation analysis if needed.

Implementation of final_svyglm()

fit_main<-final_svyglm(data=synthetic_svy,
                         dep_var="outcome",
                         covariates = c("age", "sex", "bmi"),
                         id_var      = "psu",           # Changed from psu_var
                         strata_var  = "strata",
                         weight_var  = "weight",
                         family = "binomial",
                         level = 0.95,
                         interaction_terms = NULL)
fit_main$OR_table
##      Variable        OR     CI_low   CI_high      p_value
## 1 (Intercept) 0.1598227 0.04449707 0.5740445 5.009205e-03
## 2         age 1.0379061 1.02419482 1.0518008 5.641294e-08
## 3     sexMale 1.0464121 0.77018588 1.4217065 7.714083e-01
## 4         bmi 0.9923445 0.95355357 1.0327134 7.052279e-01

## Survey-Weighted Propensity Score Weight Adjustment

Categorical Exposure

For binary and multinomial exposures, final_prop_svyglm() first estimates propensity scores using logistic or multinomial regression models, respectively. Inverse probability of treatment weights (IPTW) are constructed and stabilized before being combined with the survey sampling weights. The final analysis is conducted using a survey-weighted generalized linear model, ensuring both confounding adjustment and valid design-based inference.

Continous Exposure

For continuous exposures, final_prop_svyglm() implements a stabilized generalized propensity score (GPS) approach, where the exposure is modeled using linear regression conditional on covariates. The conditional density of the exposure is evaluated under both the covariate-adjusted model and a null model, and the ratio of these densities is used to construct stabilized inverse probability weights. These weights are then combined multiplicatively with the survey sampling weights and used in a survey-weighted final model. This approach follows the methodology proposed by Ahmed and Dwivedi (2026), adapted here for use with complex survey designs via survey::svydesign(). final_prop_svyglm() fits a survey-weighted logistic regression model using complex survey design information (PSU, strata, weights) and applies inverse probability of treatment weighting (IPTW) for binary, multinomial, or continuous exposures.

How to use final_prop_svyglm()

#final_prop_svyglm(data,  dep_var,  covariates,  exposure,  id_var,   strata_var,  weight_var, exposure_type = "binary", outcome_covariates = NULL, level = 0.95)

Argument Description

Argument Description
data Data frame containing survey data
dep_var Character. Binary outcome variable (0/1)
covariates Character vector of covariate names
exposure Character. Treatment or exposure variable
id_var Character. Primary sampling unit (PSU) / cluster variable
strata_var Character. Strata variable
weight_var Character. Survey sampling weights
exposure_type Character. "binary", "multinomial", or "continuous"
outcome_covariates Character vector of additional covariate names
level Numeric. Confidence level for intervals (default = 0.95)
... Additional arguments passed to svyglm()
—————————————————————————————

Important requirement

Before using final_prop_svyglm(), ensure the following:

  1. Binary outcome: The dependent variable must be binary (0/1).

  2. Exposure: Can be binary, multinomial (factor), or continuous.

  3. Covariates: Must exist in the dataset; used for estimating propensity scores.

  4. Survey design: PSU, strata, and weights must be available and correctly specified.

  5. No missing values: Missing values in the outcome, exposure, or covariates are removed automatically with a warning.

  6. Sufficient data per category: Each level of exposure should have enough observations to avoid unstable estimates (default threshold <5 triggers a warning).

Model output

model: Survey-weighted GLM fitted to the outcome using the final weights.

OR_table: Odds ratios with confidence intervals and p-values for each covariate and exposure, reflecting the adjusted association.

outcome: Observed outcome values used in the model.

predictions: Predicted probabilities for each observation from the fitted model.

final_weights: Combined survey and propensity/prognostic weights used in the model fitting.

Note: The S3 print() method displays only the OR_table for concise reporting in publications

Example 2

A synthetic data set with n=1500 on all required variables including respondent specific identity, strata, survey weights, covariates (age and sex), and a binary exposure. The binary outcome is generated from binomial distribution with moderate probability which depends on age and exposure variables.

set.seed(123)
n <- 1500
dat <- data.frame(
  psu = sample(1:10, n, replace = TRUE),
  strata = sample(1:5, n, replace = TRUE),
  weight = runif(n, 0.5, 2),
  age = rnorm(n, 50, 10),
  sex = factor(sample(c("Male", "Female"), n, replace = TRUE)),
  exposure_bin = rbinom(n, 1, 0.5)
)
dat$outcome <- rbinom(n, 1, plogis(-2 + 0.03*dat$age + 0.5*dat$exposure_bin))
## Multinomial exposure
dat$exposure_3cat <- cut(dat$age,
                         breaks = quantile(dat$age, probs = c(0, 1/3, 2/3, 1)),
                         labels = c("Low", "Medium", "High"),
                         include.lowest = TRUE)
exp_eff <- ifelse(dat$exposure_3cat == "Low", 0,
                  ifelse(dat$exposure_3cat == "Medium", 0.6, 1.2))
dat$outcome_cat <- rbinom(n, 1, plogis(-3 + 0.02 * dat$age + 0.4*(dat$sex=="Male") + exp_eff))

Implementation of final_prop_svyglm()

## Binary exposure
fit_bin <- final_prop_svyglm(
  dat,
  dep_var = "outcome",
  covariates = c("age", "sex"),
  exposure = "exposure_bin",
  id_var = "psu",
  strata_var = "strata",
  weight_var = "weight"
)
fit_bin$OR_table
##       Variable        OR   CI_low  CI_high      p_value
## 1  (Intercept) 0.5254107 0.453905 0.608181 2.386256e-11
## 2 exposure_bin 2.0955821 1.618936 2.712562 7.151206e-07
## Continuous exposure
fit_cont <- final_prop_svyglm(
  dat,
  dep_var = "outcome",
  covariates = c("sex"),
  exposure = "age",
  id_var = "psu",
  strata_var = "strata",
  weight_var = "weight",
  exposure_type = "continuous"
)
fit_cont$OR_table
##      Variable        OR    CI_low   CI_high      p_value
## 1 (Intercept) 0.1912251 0.1207407 0.3028559 4.909163e-09
## 2         age 1.0283275 1.0189662 1.0377748 1.990557e-07
## Multinomial exposure
fit_multi <- final_prop_svyglm(
  dat,
  dep_var = "outcome_cat",
  covariates = c("age", "sex"),
  exposure = "exposure_3cat",
  id_var = "psu",
  strata_var = "strata",
  weight_var = "weight",
  exposure_type = "multinomial"
)
## # weights:  12 (6 variable)
## initial  value 1647.918433 
## iter  10 value 374.020705
## iter  20 value 115.910795
## iter  30 value 68.979377
## iter  40 value 41.468979
## iter  50 value 38.933063
## iter  60 value 32.616794
## iter  70 value 26.129921
## iter  80 value 23.198604
## iter  90 value 20.665621
## iter 100 value 18.692511
## final  value 18.692511 
## stopped after 100 iterations
fit_multi$OR_table
##              Variable        OR     CI_low   CI_high      p_value
## 1         (Intercept) 0.1278086 0.09465193 0.1725801 2.054922e-17
## 2 exposure_3catMedium 2.4095319 1.65964401 3.4982467 2.234288e-05
## 3   exposure_3catHigh 5.3007703 3.67644241 7.6427598 1.043197e-11

Survey-Weighted Prognostic Score Weight Adjustment

As an alternative to propensity score adjustment, svyCausalGLM also supports prognostic score-adjusted survey-weighted analysis. Prognostic scores summarize the association between covariates and the outcome, allowing adjustment for high-dimensional confounding and improved efficiency in outcome modeling. The prognostic score implementation follows the full-sample stabilized prognostic weighting framework described in Ahmed and Dwivedi (2026) and Hajian-Tilaki (2018), and is integrated with complex survey design features through survey-weighted estimation. final_prog_svyglm() fits a survey-weighted logistic regression model using complex survey design information (PSU, strata, weights).
It is designed for design-based inference, not prediction. The function returns a publication-ready table including odds ratios, confidence intervals, p-values, and a survey-weighted AUC (Somers’ C).

How to use final_prog_svyglm()

##final_prog_svyglm(data, dep_var, covariates, exposure, id_var,  strata_var,  weight_var, exposure_type = "binary", level = 0.95, ...)

Argument description

Argument Description
data Data frame containing survey data
dep_var Character. Binary outcome variable (0/1)
covariates Character vector of covariate names
exposure Character. Treatment or exposure variable
id_var Character. Primary sampling unit (PSU) / cluster variable
strata_var Character. Strata variable
weight_var Character. Survey sampling weights
exposure_type Character. "binary", "continuous", or "multinomial"
level Confidence level for intervals (default = 0.95)
... Additional arguments passed to svyglm()
————————————————————————————–

Important requirements

  1. Complex survey design variables: The input data must contain valid identifiers for primary sampling units (id_var), stratification (strata_var), and sampling weights (weight_var) compatible with survey::svydesign().

  2. Binary outcome variable: The outcome specified by dep_var must be binary (coded 0/1) and contain both events and non-events; models cannot be fitted if only one outcome level is present.

  3. Exposure variable: The exposure variable may be binary, categorical, or continuous and must be correctly coded and present for all included observations.

  4. Baseline covariates for prognostic modeling Covariates supplied in covariates should represent pre-exposure variables used to estimate the prognostic score and must not include post-exposure variables.

  5. Optional outcome covariates Additional covariates may be included in the final outcome model via outcome_covariates, provided they are not used to construct the prognostic score.

  6. Complete-case data for modeling variables Observations with missing values in the outcome, exposure, or covariates used in the prognostic or outcome models are removed prior to analysis.

  7. Correct specification of survey weights: The survey weights must be positive and finite; stabilized prognostic weights are multiplied by the base survey weights to preserve design-based inference.

  8. Sufficient sample size and outcome frequency: Very small numbers of events or non-events may lead to unstable estimates; warnings are issued when sparse outcomes are detected.

  9. Independence within sampling units: Observations are assumed independent within PSUs after accounting for the specified survey design.

  10. Design-based inference objective: The function is intended for population-level association estimation under complex survey designs rather than individual-level prediction.

prognostic_model A survey-weighted logistic regression model (svyglm) used to estimate the prognostic score, fitted using baseline covariates and the complex survey design. ## Model outputs –final_model: A survey-weighted logistic regression model (svyglm) for the association between the exposure and outcome, fitted using prognostic-adjusted survey weights.

OR_table: A publication-ready table summarizing exponentiated regression coefficients (odds ratios), confidence intervals, and p-values from the final model.

outcome: The binary outcome vector used in the final model after applying complete-case filtering.

predictions: Predicted outcome probabilities from the final prognostic-weighted survey model.

final_weights: The combined prognostic and survey weights applied in the final analysis.

Implementation of final_prog_svyglm()

fit_prog <- final_prog_svyglm(
  data        = synthetic_svy,
  dep_var     = "outcome",
  exposure    = "exposure",      # Added (required by your function)
  covariates  = c("age", "sex", "bmi"), # Changed from indep_vars
  id_var      = "psu",           # Changed from psu_var
  strata_var  = "strata",
  weight_var  = "weight"
)
fit_prog$OR_table
##      Variable        OR    CI_low  CI_high      p_value
## 1 (Intercept) 0.8628626 0.6880157 1.082144 2.013395e-01
## 2    exposure 2.6671830 1.9341982 3.677940 3.358171e-09

Tabulation and Visualization of Survey-weighted AUC

AUC estimation is crucial to assess the predictive performance of competing models even if the goal of the study is to assess the true association between the variables. Existing AUC estimation procedures although provide the AUC with confidence interval the do not allow survey-weight directly and can results bias performance measures [Ahmed and Dwivedi, 2026) and reference therin]. The weighted AUC (using any of three weighting schemes discussed above) is computed using trapezoidal integration of the weighted ROC curve and corresponds to the probability that a randomly selected weighted case has a higher predicted risk than a control. Variance estimation follows the Hanley–McNeil (1982) approximation adapted for weighted data, enabling valid confidence intervals (see Ahmed and Dwivedi (2026) for same formulation in stata). The viz_auc_svyglm() function extracts predictions, outcomes, and weights from any of the fit object discussed in this package, constructs the weighted ROC curve, computes weighted AUC, and generates a publication-ready plot. The viz_auc_svyglm() function produces a weighted ROC curve and reports a weighted AUC using survey-based models. It is intended to evaluate the discriminative ability of survey-weighted logistic regression models. All computations respect the supplied analysis weights and remain valid under unequal probability sampling and post-stratification, assuming the weights correctly reflect the survey design.

Statistical details: Weighted ROC and AUC

Let \(\hat{p}_i\) denote the predicted probability for subject \(i\), \(Y_i \in \{0,1\}\) the observed outcome, and \(w_i\) the final analysis weight.

After sorting observations in decreasing order of \(\hat{p}_i\), the weighted true positive rate (TPR) and false positive rate (FPR) at threshold \(t\) are defined as:

\[ \text{TPR}(t) = \frac{\sum_{i: \hat{p}_i \ge t} w_i Y_i} {\sum_i w_i Y_i}, \quad \text{FPR}(t) = \frac{\sum_{i: \hat{p}_i \ge t} w_i (1 - Y_i)} {\sum_i w_i (1 - Y_i)}. \]

The weighted area under the curve (AUC) is obtained by trapezoidal integration of TPR with respect to FPR. This estimator corresponds to the probability that a randomly selected weighted case has a higher predicted risk than a randomly selected weighted control.

Variance estimation follows the approximation proposed by Hanley and McNeil (1982), adapted by replacing sample counts with weighted counts of cases and controls.

How to use viz_auc_svyglm()

##viz_auc_svyglm(fit_object, title = "Weighted ROC Curve", line_color = "#0072B2")

Argument Description

Argument Description
fit_object An object of class “svyCausal” returned by final_svyglm(),
final_prop_svyglm() or final_prog_svyglm(),
and contain outcome, predicted probabilities, and final analysis weights.
title Character. Title of the ROC plot. Default is "Weighted ROC Curve".
line_color Character. Color of the ROC curve line. Default is "#0072B2".

Important requirements

fit_object must contain the following named elements:

  1. outcome: binary outcome vector coded as 0/1

  2. predictions: predicted probabilities from a survey-weighted logistic model

  3. final_weights: final analysis weights (e.g., IPTW × survey weights)

  4. Lengths of outcome, predictions, and final_weights must be equal after removal of missing values.

The ROC and AUC are computed using:

– Probability-weighted empirical distributions

– Trapezoidal integration, equivalent to Stata’s senspec

– A variance approximation based on Hanley & McNeil (1982)

The function is appropriate for:

–Complex survey designs

– Inverse probability weighted (IPW) causal models

Function outputs

plot: A ggplot2 ROC curve with Weighted False Positive Rate (FPR) and Weighted True Positive Rate (TPR) – `stats’: A table including AUC value, its standard error and 95% confidence intervals.

Implementation of viz_auc_svyglm()

viz_auc_svyglm(fit_main)
## $plot

## 
## $stats
##       AUC        SE       LCI       UCI 
## 0.6136489 0.0150437 0.5841633 0.6431346
viz_auc_svyglm(fit_bin)
## $plot

## 
## $stats
##         AUC          SE         LCI         UCI 
## 0.598019079 0.009356339 0.579680654 0.616357504
viz_auc_svyglm(fit_prog)
## $plot

## 
## $stats
##        AUC         SE        LCI        UCI 
## 0.62003454 0.01439754 0.59181536 0.64825372

Choosing the appropriate function

Reproducibility

All examples in this vignette are fully reproducible and rely only on simulated datasets. Random seeds are set where applicable.

Limitations

The methods implemented in svyCausalGLM rely on correct specification of the propensity or prognostic score models and assume no unmeasured confounding. The weighted AUC variance estimator is based on large-sample approximations and may be unstable in settings with extremely sparse outcomes or highly variable weights. Users are encouraged to assess weight distributions and conduct sensitivity analyses where appropriate.

References

–Austin PC. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research, 2011;46:399–424.

–Lunceford JK, Davidian M. “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study.” Statistics in Medicine, 2004;23:2937–2960.

–Lumley T. Complex Surveys: A Guide to Analysis Using R. Wiley, 2010.

– Ahmed and Dwivedi. (2026):Computing adjusted effect size, area under the curve, and c-statistic for evaulating the association between uric acid and mortality in US adults using unweighted and survey-weighted regression, propensity, and prognostic score analyses.

–Ridgeway, Greg, Daniel F. McCaffrey, Andrew R. Morral, Matthew Cefalu, Lane F. Burgette, Joseph D. Pane, and Beth Ann Griffin, Toolkit for Weighting and Analysis of Nonequivalent Groups: A Tutorial for the R TWANG Package. Santa Monica, CA: RAND Corporation, 2022. https://www.rand.org/pubs/tools/TLA570-5.html.

–Zhou, T., Tong, G., Li, F., & Thomas, L. E. (2020). PSweight: an R package for propensity score weighting analysis. arXiv preprint arXiv:2010.08893.

–Hajian-Tilaki, K. (2018). Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian Journal of Internal Medicine.

–Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.

mirror server hosted at Truenetwork, Russian Federation.