---
title: "Simulation Validation of the Mixed-Subjects MML Estimator"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Simulation Validation of the Mixed-Subjects MML Estimator}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", echo = FALSE)
library(knitr)

# Load a scenario's $summary from simulations/results/<name>.rds if present
# (so the tables reflect the most recent run); otherwise fall back to the cached
# values below. The simulations/ tree is not shipped with the package, so on a
# clean install the cached values are shown.
load_summary <- function(name) {
  candidates <- c(
    file.path("..", "simulations", "results", paste0(name, ".rds")),
    file.path("simulations", "results", paste0(name, ".rds"))
  )
  for (p in candidates) {
    if (file.exists(p)) {
      s <- tryCatch(readRDS(p)$summary, error = function(e) NULL)
      if (!is.null(s)) return(s)
    }
  }
  NULL
}

show_cols <- function(df, cols, ...) {
  cols <- intersect(cols, names(df))
  kable(df[, cols, drop = FALSE], row.names = FALSE, ...)
}
```

This vignette presents a simulation study validating the scalar 2PL marginal maximum-likelihood (MML) PPI++ estimator (`fit_mixed_subjects_mml`) and its ability-risk tuning (`tune_lambda_ability_risk`). The full reproducible harness lives in the package's `simulations/` directory; the tables below load the most recent run from `simulations/results/` when available and otherwise show cached values. Rep counts: $\lambda$-selection 100, coverage 200, downstream 100, cross-fit 100.

The study answers five questions:

1. Does $\lambda$-selection track predictor quality?
2. Do standard errors achieve appropriate coverage?
3. Does the method improve downstream scoring?
4. What is the role of cross-fitting?
5. Is coverage valid at the tuned $\lambda$?

## Design

Human responses come from a true 8-item 2PL model (`a ∈ [0.8, 1.6]`, `d ∈ [-1.1, 1.1]`) with `n = 400` human subjects and `N = 1200` generated subjects, abilities drawn from a standard normal. Four predictor regimes vary the quality of the paired LLM predictions, `F`:

```{r design-table}
design <- data.frame(
  Regime = c("R1: perfect prediction (F=Y)", "R2: same-DGP draw",
             "R3: independent noise", "R4: LLM shift"),
  `Predicted F` = c(
    "predicted = observed (exact)",
    "fresh binary draw from the true model",
    "binary draw from scrambled item parameters",
    "binary draw from attenuated/shifted parameters"),
  `Role` = c("perfect predictor", "modest real signal",
             "uninformative LLM", "biased but informative LLM"),
  check.names = FALSE
)
kable(design, caption = "Four predictor regimes (all binar responses).")
```

All responses (observed and predicted) are binary $Y_{ij}, F_{ij} \in \{0,1\}$. Note that the package does not currently accept probability predictions for `predicted`/`generated` (see the note at the end of Scenario 1) or polytomous responses. The control parameter $\lambda \in [0,1]$ sets how strongly the generated LLM sample is used after subtracting the paired LLM correction; $\lambda = 0$ is a human-only calibration.

## Does $\lambda$-selection track predictor quality?

For each regime we tune $\lambda$ by ability-score risk and record the selected value;
we also report the theoretical PPI++-score $\lambda$ for reference.

Here $\lambda$ is selected by `tune_lambda_ability_risk()`, which optimizes the risk directly. We do not bother using the cross-fit estimator, as we only are concerned with $\lambda$ magnitude.

```{r lambda-table}
lam <- load_summary("lambda_selection")
if (is.null(lam)) {
  lam <- data.frame(
    label       = c("R1: perfect (F=Y)", "R2: same-DGP draw",
                    "R3: independent noise", "R4: LLM shift"),
    mean_risk   = c(0.750, 0.119, 0.063, 0.105),
    median_risk = c(0.750, 0.108, 0.040, 0.104),
    prop_zero   = c(0.00, 0.07, 0.35, 0.16),
    mean_ppi    = c(0.750, 0.004, 0.000, 0.002)
  )
}
show_cols(lam, c("label", "mean_risk", "median_risk", "prop_zero", "mean_ppi"),
          caption = "Selected lambda by regime (ability-risk tuning).")
```

**Takeaways:**

- R1 (perfect predictor) produces $\lambda = 0.750$, exactly `N / (n + N) = 1200 / 1600`, the theoretical maximum for a perfect paired predictor at these values of $n$ and $N$. The PPI++ estimated $\lambda$ from minimizing $\text{Tr}\big [ \Sigma_\gamma \big]$ agrees.
- R3 (independent noise) produces $\lambda ≈ 0.06$ (median 0.04; about a third of reps select exactly 0). The uninformative predictor is correctly down-weighted to near zero. The small positive median is the optimizer resolving a shallow, noisy risk minimum.
- R2 (same-DGP) and R4 (LLM shift) produce $\lambda \approx 0.11$. Fresh real responses (R2) and a correlated-but-biased LLM (R4) each carry some score-level signal, and the criterion makes some use of it.

**Note that predictions must be sampled responses, not probabilities.** We require `predicted`/`generated` responses and reject probability inputs. This is not a convenience restriction. The response vector enters inside a log-sum over quadrature points, so feeding probabilities breaks the identity $\mathbb{E}\big[\nabla L_{gen}\big] = \mathbb{E}\big[\nabla L_{pred}\big]$ that makes the PPI loss correction mean-zero. The estimator then becomes biased at $\lambda > 0$, and at moderate $\lambda$ the objective is unbounded for discrimination parameters, causing the fit to diverge. Practically: if you have model-derived probabilities, sample responses from them (e.g. `rbinom`) before calibrating.

## Do standard errors achieve appropriate coverage?

We fix $\lambda$ = 0.5 and compare empirical coverage of Wald intervals built from two covariance estimates of the same fit: the Louis-corrected marginal sandwich (`vcov()` dispatch) and the EM complete-data Hessian bread (`vcov_mixed_subjects()`).

Note that the PPI correction `$\lambda$(L_gen − L_pred)` is mean-zero whenever the paired and generated pseudo-responses are drawn from the same distribution with the same ability spread, so the human term anchors the estimand to the true parameters. This holds for every simulation condition: R1, R2, R3 (useless LLM) and R4 (biased LLM). As such, the estimator stays consistent for the truth even when the LLM is uninformative or biased.

```{r coverage-table}
cov <- load_summary("coverage")
if (is.null(cov)) {
  cov <- data.frame(
    label        = c("R1 perfect (F=Y)", "R2 same-DGP draw",
                     "R3 independent noise", "R4 LLM shift"),
    louis_cov_90 = c(0.909, 0.916, 0.914, 0.905),
    louis_cov_95 = c(0.955, 0.957, 0.961, 0.954),
    em_cov_90    = c(0.713, 0.727, 0.721, 0.726),
    em_cov_95    = c(0.787, 0.797, 0.795, 0.792),
    mean_se_ratio = c(1.626, 1.616, 1.651, 1.622)
  )
}
show_cols(cov, c("label", "louis_cov_90", "louis_cov_95",
                 "em_cov_90", "em_cov_95", "mean_se_ratio"),
          caption = paste("Item-parameter CI coverage, all four regimes",
            "(200 reps). Nominal targets 0.90 and 0.95."))
```

See that the Louis correction restores nominal coverage while the EM bread in the sandwich covariance under-covers. Across all four conditions, including the useless LLM (R3) and the biased LLM (R4), Louis intervals attain ~0.91/0.96 against nominal 0.90/0.95, while the EM bread covers only ~0.71/0.79; it understates uncertainty because it ignores the missing information about each subject's latent ability.


## Does the method improve downstream scoring?

We score a held-out sample of 1000 subjects with known ability and compare ability-score RMSE for the human-only calibration ($\lambda$ = 0) and the tuned MML calibration. `mean_delta = RMSE(tuned) − RMSE(human)`; negative is an improvement. We report the paired 95% CI of `mean_delta` and `prop_improve` (the fraction of replications where tuned beats human-only) so a small stable effect can be told apart from noise.

```{r downstream-table}
dwn <- load_summary("downstream")
if (is.null(dwn)) {
  dwn <- data.frame(
    label       = c("R1: perfect (F=Y)", "R2: same-DGP draw",
                    "R3: independent noise", "R4: LLM shift"),
    mean_lambda = c(0.750, 0.119, 0.063, 0.105),
    rmse_human  = c(1.4961, 1.5023, 1.5037, 1.5064),
    rmse_tuned  = c(1.4923, 1.5005, 1.5030, 1.5046),
    mean_delta  = c(-0.0038, -0.0019, -0.0006, -0.0018),
    delta_lo    = c(-0.0073, -0.0027, -0.0009, -0.0030),
    delta_hi    = c(-0.0003, -0.0011, -0.0003, -0.0007),
    prop_improve = c(0.48, 0.59, 0.50, 0.50),
    bias_a      = c(0.010, 0.025, 0.035, 0.008)
  )
}
show_cols(dwn, c("label", "mean_lambda", "rmse_human", "rmse_tuned",
                 "mean_delta", "delta_lo", "delta_hi", "prop_improve", "bias_a"),
          caption = "Downstream ability-score RMSE.")
```

**Takeaways.**

- In these simulations the tuned RMSE did not increase average held-out RMSE relative to human-only in any regime. Every `mean_delta` is negative and its paired 95% CI excludes zero.
- Discrimination is unbiased at the selected $\lambda$ ($|\text{bias}_a| \leq 0.04$ everywhere), confirming that the tuner is not selecting degenerate fits.
- RMSE gains are small because an 8-item test is measurement-limited: ability scoring is dominated by the irreducible measurement error of a short test. We do still see an improvement in item parameter precision, which contributes to reduced calibration uncertainty.


## What is the role of cross-fitting?

Cross-fitting tunes $\lambda$ on held-out folds so a fold's own labels are not used to tune the $\lambda$ applied to its paired correction. We compare the non-cross-fitted tuner against `tune_lambda_ability_risk_crossfit` (MML per-fold tuning, scalar-mean MML final fit) on selected $\lambda$, item-parameter bias, coverage, and held-out RMSE.

```{r crossfit-table}
cf <- load_summary("crossfit")
if (is.null(cf)) {
  cf <- data.frame(
    label       = c("R1 perfect (F=Y)", "R2 same-DGP draw", "R4 LLM shift"),
    lambda_nocf = c(0.750, 0.119, 0.100),
    lambda_cf   = c(0.858, 0.135, 0.115),
    bias_a_nocf = c(0.0104, 0.0250, 0.0374),
    bias_a_cf   = c(0.0100, 0.0258, 0.0372),
    rmse_nocf   = c(1.4923, 1.5005, 1.5027),
    rmse_cf     = c(1.4927, 1.5005, 1.5029),
    cover_nocf  = c(0.954, 0.954, 0.953),
    cover_cf    = c(0.949, 0.956, 0.956)
  )
}
show_cols(cf, c("label", "lambda_nocf", "lambda_cf", "bias_a_nocf", "bias_a_cf",
                "rmse_nocf", "rmse_cf", "cover_nocf", "cover_cf"),
          caption = "Cross-fitted vs non-cross-fitted tuning (100 reps, R1/R2/R4).")
```

Cross-fitting does not change observed behavior in these simulations, but guards against finite-sample bias. Item-parameter bias, held-out
RMSE, and 95% Wald coverage are essentially identical between the non-cross-fitted and cross-fitted tuners (differences in the fourth decimal).  Because the downstream quantities are unchanged, the cheaper non-cross-fitted tuner carries value for intermediate testing and exploration, but prefer cross-fitting for final model fitting.

## Is coverage valid at the tuned $\lambda$?

Scenario 2 validated the covariance *formula* at a fixed $\lambda$, but the $\lambda$ we use is estimated from data, and `vcov()` treats $\lambda$ as fixed. As such, it does not propagate the uncertainty in $\hat{\lambda}$ into $\hat{\Sigma}_\gamma$. Additionally, choosing $\lambda$ on the same data used to estimate the item parameters induces finite-sample bias in those estimates (and, in principle, anti-conservative intervals). Split-sample (cross-fit) $\lambda$ estimation removes that bias.

This scenario measures the size of the effect by comparing coverage of the true item parameters at three operating points: fixed $\lambda$ = 0.5, the same-data tuned $\hat{\lambda}$ with `vcov(best_fit)`, and the cross-fit tuned $\hat{\lambda}$ with `vcov(final_fit)`. It reuses the Scenario 2 seeds, so the fixed-$\lambda$ column reproduces Scenario 2 exactly.

```{r coverage-tuned-table}
ct <- load_summary("coverage_tuned")
if (is.null(ct)) {
  ct <- data.frame(
    label        = c("R1 perfect (F=Y)", "R2 same-DGP draw",
                     "R3 independent noise", "R4 LLM shift"),
    lambda_sd    = c(0.751, 0.116, 0.053, 0.103),
    lambda_xf    = c(0.859, 0.135, 0.074, 0.130),
    fixed_95     = c(0.955, 0.957, 0.961, 0.954),
    samedata_95  = c(0.955, 0.956, 0.958, 0.958),
    crossfit_95  = c(0.953, 0.956, 0.958, 0.958),
    bias_a_fixed    = c(0.0097, 0.0280, 0.0365, 0.0207),
    bias_a_samedata = c(0.0063, 0.0224, 0.0286, 0.0159),
    bias_a_crossfit = c(0.0051, 0.0222, 0.0289, 0.0166)
  )
}
show_cols(ct, c("label", "lambda_sd", "lambda_xf",
                "fixed_95", "samedata_95", "crossfit_95"),
          caption = paste("95% coverage rates of the true item parameters at the",
            "fixed, same-data-tuned, and cross-fit-tuned λ (200 reps).",
            "lambda_sd / lambda_xf are the mean selected λ for each tuner."))
```

And the matching item-parameter (discrimination) bias:

```{r coverage-tuned-bias}
show_cols(ct, c("label", "bias_a_fixed", "bias_a_samedata", "bias_a_crossfit"),
          caption = paste("Mean discrimination bias E[a-hat - a] at each operating",
            "point. Same-data vs cross-fit differ by <= 0.001 except R1."))
```

**Takeaways.**

- Same-data tuning produces nominal coverage (95% coverage 0.955–0.958), and the finite-sample bias is small: same-data vs. cross-fit discrimination bias differ by $\leq 0.001$ in every regime except R1.
- Cross-fitting removes bias in the perfect predictor (R1), but this bias is small.
- Cross-fitting gives the same coverage (0.953–0.958) while also guarding against finite-sample tuning bias.

Cross-fitting is the principled default, despite the same-data tuning performing well. It is guaranteed free of the finite-sample tuning bias.

## Summary

| Question | Result |
|----------|--------|
| Does $\lambda$-selection track predictor quality? | Yes: better predictors select higher $\lambda$ |
| Do standard errors achieve appropriate coverage? | Yes: nominal coverage, including under a biased predictor (R3/R4) |
| Does the method improve downstream scoring? | No average harm demonstrated and per-rep improvement in about half of reps |
| What is the role of cross-fitting? | Cross-fitting removes finite sample bias, despite reporting similar bias/RMSE/coverage |
| Is coverage valid at the tuned $\lambda$? | Yes |


## Reproducing these results

From the package root (second argument is the number of cores):

```r
# Rscript simulations/run_lambda_selection.R 100 8
# Rscript simulations/run_coverage.R 200 8
# Rscript simulations/run_downstream.R 100 8
# Rscript simulations/run_crossfit.R 50 8
# Rscript simulations/figures.R
```

See `simulations/README.md` for the full design, the deterministic per-task seeding (results are identical serially or in parallel), and interpretation notes.
