Statistical Methodology

Overview

This vignette describes the statistical model, likelihood, prior distributions, and MCMC algorithm implemented in hhdynamics. For a practical workflow guide, see vignette("hhdynamics-intro").

The model is a discrete-time Bayesian household transmission model that separates community and within-household infection sources. It was originally described in Tsang et al. (2014) and has been extended here to support covariate effects, missing data imputation, and optional serial interval estimation.

Model

Study design

The data come from a case-ascertained household study. Each household is enrolled through an index case (the first identified case), and household contacts are followed prospectively for a defined period. The observed data for each contact include whether they became infected and, if so, the day of symptom onset. Covariates (age, sex, etc.) may be recorded for each individual.

Infection hazard

For household contact \(i\) in household \(h\) on day \(t\), the hazard of infection is:

\[ \lambda_i(t) = \left[\beta_c + \sum_{\substack{j \in \mathcal{I}_h \\ j \neq i}} \frac{\beta_h \cdot w(t - t_j) \cdot \exp(\mathbf{x}_j^\top \boldsymbol{\alpha}_{\text{inf}} + \varepsilon_j)}{(n_h - 1)^\gamma}\right] \cdot \exp(\mathbf{z}_i^\top \boldsymbol{\alpha}_{\text{sus}}) \]

where:

The daily probability that contact \(i\) becomes infected on day \(t\), given they are still susceptible, is:

\[ P(\text{infected on day } t \mid \text{susceptible at } t) = 1 - \exp(-\lambda_i(t)) \]

Covariates

Covariate effects enter the hazard multiplicatively on the log scale:

Factor covariates are encoded as dummy variables using treatment contrasts (the first level is the reference).

Likelihood

Individual likelihood

For contact \(i\) in household \(h\), let \(T_i\) be the infection day (if infected) or the end of follow-up plus one (if not infected), and let \(s_h\) be the start of the observation period (the index case’s onset day). Let \(\delta_i = 1\) if infected, \(0\) otherwise.

The log-likelihood contribution of contact \(i\) is:

\[ \ell_i = -\sum_{t=s_h}^{T_i - 2} \lambda_i(t) + \delta_i \cdot \log\left(1 - \exp(-\lambda_i(T_i - 1))\right) \]

The first term captures the survival probability (not being infected on days before \(T_i\)), and the second term captures the probability of infection on day \(T_i\) (only included if the contact was actually infected).

Full likelihood

The full data log-likelihood is the sum over all non-index contacts in all households:

\[ \ell(\boldsymbol{\theta}) = \sum_{h=1}^{H} \sum_{i=1}^{n_h - 1} \ell_i \]

where \(\boldsymbol{\theta} = (\beta_c, \beta_h, \gamma, \boldsymbol{\alpha}_{\text{inf}}, \boldsymbol{\alpha}_{\text{sus}}, \sigma_\varepsilon)\) is the full parameter vector.

When random effects are included, the likelihood also includes the random effects density:

\[ \ell_{\text{RE}} = \sum_{h=1}^{H} \sum_{j \in \mathcal{I}_h} \log \phi(\varepsilon_j; 0, \sigma_\varepsilon^2) \]

where \(\phi(\cdot; 0, \sigma^2)\) is the normal density.

Prior distributions

Parameter Prior Description
\(\sigma_\varepsilon\) Uniform(0.009, 5) RE standard deviation (fixed at 1 when random effects are disabled)
\(\beta_c\) Uniform(\(10^{-18}\), 9.99) Community infection rate
\(\beta_h\) Uniform(\(10^{-18}\), 9.99) Household transmission rate
\(\gamma\) Uniform(0, 1) Household size parameter (currently fixed at 0)
\(\alpha_{\text{inf},k}\) Normal(0, 3) Each infectivity covariate effect
\(\alpha_{\text{sus},k}\) Normal(0, 3) Each susceptibility covariate effect
Weibull shape Uniform(0.1, 10) SI shape (only when estimate_SI = TRUE)
Weibull scale Uniform(0.1, 20) SI scale (only when estimate_SI = TRUE)

The uniform priors on \(\beta_c\) and \(\beta_h\) are deliberately wide and effectively non-informative for realistic transmission rates. The Normal(0, 3) prior on covariate effects is weakly informative — it assigns most probability mass to relative risks between \(\exp(-6) \approx 0.002\) and \(\exp(6) \approx 400\), which is very permissive for epidemiological covariate effects.

MCMC algorithm

Metropolis-Hastings sampler

Parameters are updated one at a time via Metropolis-Hastings with Gaussian random-walk proposals. At iteration \(b\):

  1. For each parameter \(\theta_k\) with a free indicator (move[k] = 1):
    1. Propose \(\theta_k^* = \theta_k^{(b-1)} + \epsilon\), where \(\epsilon \sim \text{Normal}(0, \sigma_k^2)\).
    2. Compute the log acceptance ratio: \[ r = \left[\ell(\boldsymbol{\theta}^*) + \log \pi(\boldsymbol{\theta}^*)\right] - \left[\ell(\boldsymbol{\theta}^{(b-1)}) + \log \pi(\boldsymbol{\theta}^{(b-1)})\right] \]
    3. Accept with probability \(\min(1, \exp(r))\).

If the proposed parameter falls outside the prior support, the prior log-density is \(-\infty\) and the proposal is automatically rejected.

Adaptive proposal tuning

After the first 500 iterations, proposal standard deviations \(\sigma_k\) are set to the empirical posterior standard deviation of the chain so far, with multiplicative adjustments based on acceptance rates:

Acceptance rate Adjustment
< 10% \(\sigma_k \times 0.5\)
10–15% \(\sigma_k \times 0.8\)
15–20% \(\sigma_k \times 0.95\)
20–30% No change (target range)
30–40% \(\sigma_k \times 1.05\)
40–90% \(\sigma_k \times 1.2\)
> 90% \(\sigma_k \times 2.0\)

This adaptive scheme typically achieves stable acceptance rates within 1000–2000 iterations.

Data augmentation

At each MCMC iteration, after updating all parameters, the following latent variables are updated for each household member via a joint Metropolis-Hastings step:

  1. Infection times: For infected non-index contacts, a new onset time is proposed uniformly over the follow-up window. The proposal is accepted or rejected based on the full household likelihood ratio.

  2. Random effects (when enabled): For each infected individual, a new random effect \(\varepsilon_j^*\) is proposed from \(\varepsilon_j + \text{Normal}(0, \sigma_\varepsilon)\).

  3. Missing covariates: For individuals with missing factor covariates, a new level is proposed uniformly from all levels of the factor. This is a symmetric proposal, so no correction term is needed.

All three augmentation steps are proposed jointly for each household member and accepted or rejected together. This joint update is more efficient than separate updates because the onset time, random effect, and covariate value are correlated in the posterior. The augmentation step is parallelized across households using RcppParallel.

Serial interval estimation

When estimate_SI = TRUE, two additional parameters (Weibull shape and scale) are added to the MCMC. At each iteration, the serial interval PMF is recomputed from the current Weibull parameters:

\[ w(d) = F_W(d + 1; k, \lambda) - F_W(d; k, \lambda), \quad d = 1, 2, \ldots, 14 \]

where \(F_W(\cdot; k, \lambda)\) is the Weibull CDF with shape \(k\) and scale \(\lambda\). The Weibull parameters are updated via the same Metropolis-Hastings random-walk scheme as the other parameters.

Convergence diagnostics

Because the sampler uses a single chain with adaptive tuning, standard multi-chain diagnostics (such as \(\hat{R}\)) are not directly applicable. Instead, we recommend:

  1. Trace plots: plot_diagnostics(fit) shows trace plots and posterior densities for all parameters. Look for stationarity after burn-in and good mixing (no long excursions or sticky regions).

  2. Effective sample size (ESS): plot_diagnostics(fit, show_ess = TRUE) or table_parameters(fit, show_ess = TRUE) report ESS using the initial positive sequence estimator (Geyer, 1992). ESS values below 100 suggest poor mixing; consider increasing n_iteration or investigating identifiability.

  3. Acceptance rates: table_parameters(fit) shows acceptance rates. Rates outside 15–40% indicate poor tuning, which may occur if the burn-in was insufficient for the adaptive scheme to converge. In practice, the default burn-in of 5000 iterations is usually sufficient.

  4. Log-likelihood trace: fit$log_likelihood stores the full log-likelihood trace (before thinning) for visual inspection.

Interpretation of parameters

Base parameters

Covariate effects

Covariate effects are estimated on the log scale. The summary() method reports both the raw (log-scale) estimates and the exponentiated estimates:

For example, if the susceptibility effect for age group “6-17” is \(\hat{\alpha} = -0.3\) with 95% CrI \((-0.8, 0.2)\), then \(\exp(\hat{\alpha}) = 0.74\) with 95% CrI \((0.45, 1.22)\). This means children aged 6-17 have an estimated 26% lower susceptibility than the reference group, but the credible interval includes 1 (no difference).

Serial interval parameters

When estimate_SI = TRUE, the output includes si_shape and si_scale — the Weibull shape (\(k\)) and scale (\(\lambda\)) parameters. The mean serial interval is \(\lambda \cdot \Gamma(1 + 1/k)\) and the variance is \(\lambda^2[\Gamma(1 + 2/k) - \Gamma(1 + 1/k)^2]\).

Computational details

References

Tsang TK, Lau EHY, Cauchemez S, Cowling BJ. Association between antibody titers and protection against influenza virus infection within households. Journal of Infectious Diseases. 2014;210(5):684–692. https://doi.org/10.1093/infdis/jiu186

Geyer CJ. Practical Markov chain Monte Carlo. Statistical Science. 1992;7(4):473–483.

mirror server hosted at Truenetwork, Russian Federation.