--- title: "Simulation Validation of the Mixed-Subjects MML Estimator" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Simulation Validation of the Mixed-Subjects MML Estimator} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", echo = FALSE) library(knitr) # Load a scenario's $summary from simulations/results/.rds if present # (so the tables reflect the most recent run); otherwise fall back to the cached # values below. The simulations/ tree is not shipped with the package, so on a # clean install the cached values are shown. load_summary <- function(name) { candidates <- c( file.path("..", "simulations", "results", paste0(name, ".rds")), file.path("simulations", "results", paste0(name, ".rds")) ) for (p in candidates) { if (file.exists(p)) { s <- tryCatch(readRDS(p)$summary, error = function(e) NULL) if (!is.null(s)) return(s) } } NULL } show_cols <- function(df, cols, ...) { cols <- intersect(cols, names(df)) kable(df[, cols, drop = FALSE], row.names = FALSE, ...) } ``` This vignette presents a simulation study validating the scalar 2PL marginal maximum-likelihood (MML) PPI++ estimator (`fit_mixed_subjects_mml`) and its ability-risk tuning (`tune_lambda_ability_risk`). The full reproducible harness lives in the package's `simulations/` directory; the tables below load the most recent run from `simulations/results/` when available and otherwise show cached values. Rep counts: $\lambda$-selection 100, coverage 200, downstream 100, cross-fit 100. The study answers five questions: 1. Does $\lambda$-selection track predictor quality? 2. Do standard errors achieve appropriate coverage? 3. Does the method improve downstream scoring? 4. What is the role of cross-fitting? 5. Is coverage valid at the tuned $\lambda$? ## Design Human responses come from a true 8-item 2PL model (`a ∈ [0.8, 1.6]`, `d ∈ [-1.1, 1.1]`) with `n = 400` human subjects and `N = 1200` generated subjects, abilities drawn from a standard normal. Four predictor regimes vary the quality of the paired LLM predictions, `F`: ```{r design-table} design <- data.frame( Regime = c("R1: perfect prediction (F=Y)", "R2: same-DGP draw", "R3: independent noise", "R4: LLM shift"), `Predicted F` = c( "predicted = observed (exact)", "fresh binary draw from the true model", "binary draw from scrambled item parameters", "binary draw from attenuated/shifted parameters"), `Role` = c("perfect predictor", "modest real signal", "uninformative LLM", "biased but informative LLM"), check.names = FALSE ) kable(design, caption = "Four predictor regimes (all binar responses).") ``` All responses (observed and predicted) are binary $Y_{ij}, F_{ij} \in \{0,1\}$. Note that the package does not currently accept probability predictions for `predicted`/`generated` (see the note at the end of Scenario 1) or polytomous responses. The control parameter $\lambda \in [0,1]$ sets how strongly the generated LLM sample is used after subtracting the paired LLM correction; $\lambda = 0$ is a human-only calibration. ## Does $\lambda$-selection track predictor quality? For each regime we tune $\lambda$ by ability-score risk and record the selected value; we also report the theoretical PPI++-score $\lambda$ for reference. Here $\lambda$ is selected by `tune_lambda_ability_risk()`, which optimizes the risk directly. We do not bother using the cross-fit estimator, as we only are concerned with $\lambda$ magnitude. ```{r lambda-table} lam <- load_summary("lambda_selection") if (is.null(lam)) { lam <- data.frame( label = c("R1: perfect (F=Y)", "R2: same-DGP draw", "R3: independent noise", "R4: LLM shift"), mean_risk = c(0.750, 0.119, 0.063, 0.105), median_risk = c(0.750, 0.108, 0.040, 0.104), prop_zero = c(0.00, 0.07, 0.35, 0.16), mean_ppi = c(0.750, 0.004, 0.000, 0.002) ) } show_cols(lam, c("label", "mean_risk", "median_risk", "prop_zero", "mean_ppi"), caption = "Selected lambda by regime (ability-risk tuning).") ``` **Takeaways:** - R1 (perfect predictor) produces $\lambda = 0.750$, exactly `N / (n + N) = 1200 / 1600`, the theoretical maximum for a perfect paired predictor at these values of $n$ and $N$. The PPI++ estimated $\lambda$ from minimizing $\text{Tr}\big [ \Sigma_\gamma \big]$ agrees. - R3 (independent noise) produces $\lambda ≈ 0.06$ (median 0.04; about a third of reps select exactly 0). The uninformative predictor is correctly down-weighted to near zero. The small positive median is the optimizer resolving a shallow, noisy risk minimum. - R2 (same-DGP) and R4 (LLM shift) produce $\lambda \approx 0.11$. Fresh real responses (R2) and a correlated-but-biased LLM (R4) each carry some score-level signal, and the criterion makes some use of it. **Note that predictions must be sampled responses, not probabilities.** We require `predicted`/`generated` responses and reject probability inputs. This is not a convenience restriction. The response vector enters inside a log-sum over quadrature points, so feeding probabilities breaks the identity $\mathbb{E}\big[\nabla L_{gen}\big] = \mathbb{E}\big[\nabla L_{pred}\big]$ that makes the PPI loss correction mean-zero. The estimator then becomes biased at $\lambda > 0$, and at moderate $\lambda$ the objective is unbounded for discrimination parameters, causing the fit to diverge. Practically: if you have model-derived probabilities, sample responses from them (e.g. `rbinom`) before calibrating. ## Do standard errors achieve appropriate coverage? We fix $\lambda$ = 0.5 and compare empirical coverage of Wald intervals built from two covariance estimates of the same fit: the Louis-corrected marginal sandwich (`vcov()` dispatch) and the EM complete-data Hessian bread (`vcov_mixed_subjects()`). Note that the PPI correction `$\lambda$(L_gen − L_pred)` is mean-zero whenever the paired and generated pseudo-responses are drawn from the same distribution with the same ability spread, so the human term anchors the estimand to the true parameters. This holds for every simulation condition: R1, R2, R3 (useless LLM) and R4 (biased LLM). As such, the estimator stays consistent for the truth even when the LLM is uninformative or biased. ```{r coverage-table} cov <- load_summary("coverage") if (is.null(cov)) { cov <- data.frame( label = c("R1 perfect (F=Y)", "R2 same-DGP draw", "R3 independent noise", "R4 LLM shift"), louis_cov_90 = c(0.909, 0.916, 0.914, 0.905), louis_cov_95 = c(0.955, 0.957, 0.961, 0.954), em_cov_90 = c(0.713, 0.727, 0.721, 0.726), em_cov_95 = c(0.787, 0.797, 0.795, 0.792), mean_se_ratio = c(1.626, 1.616, 1.651, 1.622) ) } show_cols(cov, c("label", "louis_cov_90", "louis_cov_95", "em_cov_90", "em_cov_95", "mean_se_ratio"), caption = paste("Item-parameter CI coverage, all four regimes", "(200 reps). Nominal targets 0.90 and 0.95.")) ``` See that the Louis correction restores nominal coverage while the EM bread in the sandwich covariance under-covers. Across all four conditions, including the useless LLM (R3) and the biased LLM (R4), Louis intervals attain ~0.91/0.96 against nominal 0.90/0.95, while the EM bread covers only ~0.71/0.79; it understates uncertainty because it ignores the missing information about each subject's latent ability. ## Does the method improve downstream scoring? We score a held-out sample of 1000 subjects with known ability and compare ability-score RMSE for the human-only calibration ($\lambda$ = 0) and the tuned MML calibration. `mean_delta = RMSE(tuned) − RMSE(human)`; negative is an improvement. We report the paired 95% CI of `mean_delta` and `prop_improve` (the fraction of replications where tuned beats human-only) so a small stable effect can be told apart from noise. ```{r downstream-table} dwn <- load_summary("downstream") if (is.null(dwn)) { dwn <- data.frame( label = c("R1: perfect (F=Y)", "R2: same-DGP draw", "R3: independent noise", "R4: LLM shift"), mean_lambda = c(0.750, 0.119, 0.063, 0.105), rmse_human = c(1.4961, 1.5023, 1.5037, 1.5064), rmse_tuned = c(1.4923, 1.5005, 1.5030, 1.5046), mean_delta = c(-0.0038, -0.0019, -0.0006, -0.0018), delta_lo = c(-0.0073, -0.0027, -0.0009, -0.0030), delta_hi = c(-0.0003, -0.0011, -0.0003, -0.0007), prop_improve = c(0.48, 0.59, 0.50, 0.50), bias_a = c(0.010, 0.025, 0.035, 0.008) ) } show_cols(dwn, c("label", "mean_lambda", "rmse_human", "rmse_tuned", "mean_delta", "delta_lo", "delta_hi", "prop_improve", "bias_a"), caption = "Downstream ability-score RMSE.") ``` **Takeaways.** - In these simulations the tuned RMSE did not increase average held-out RMSE relative to human-only in any regime. Every `mean_delta` is negative and its paired 95% CI excludes zero. - Discrimination is unbiased at the selected $\lambda$ ($|\text{bias}_a| \leq 0.04$ everywhere), confirming that the tuner is not selecting degenerate fits. - RMSE gains are small because an 8-item test is measurement-limited: ability scoring is dominated by the irreducible measurement error of a short test. We do still see an improvement in item parameter precision, which contributes to reduced calibration uncertainty. ## What is the role of cross-fitting? Cross-fitting tunes $\lambda$ on held-out folds so a fold's own labels are not used to tune the $\lambda$ applied to its paired correction. We compare the non-cross-fitted tuner against `tune_lambda_ability_risk_crossfit` (MML per-fold tuning, scalar-mean MML final fit) on selected $\lambda$, item-parameter bias, coverage, and held-out RMSE. ```{r crossfit-table} cf <- load_summary("crossfit") if (is.null(cf)) { cf <- data.frame( label = c("R1 perfect (F=Y)", "R2 same-DGP draw", "R4 LLM shift"), lambda_nocf = c(0.750, 0.119, 0.100), lambda_cf = c(0.858, 0.135, 0.115), bias_a_nocf = c(0.0104, 0.0250, 0.0374), bias_a_cf = c(0.0100, 0.0258, 0.0372), rmse_nocf = c(1.4923, 1.5005, 1.5027), rmse_cf = c(1.4927, 1.5005, 1.5029), cover_nocf = c(0.954, 0.954, 0.953), cover_cf = c(0.949, 0.956, 0.956) ) } show_cols(cf, c("label", "lambda_nocf", "lambda_cf", "bias_a_nocf", "bias_a_cf", "rmse_nocf", "rmse_cf", "cover_nocf", "cover_cf"), caption = "Cross-fitted vs non-cross-fitted tuning (100 reps, R1/R2/R4).") ``` Cross-fitting does not change observed behavior in these simulations, but guards against finite-sample bias. Item-parameter bias, held-out RMSE, and 95% Wald coverage are essentially identical between the non-cross-fitted and cross-fitted tuners (differences in the fourth decimal). Because the downstream quantities are unchanged, the cheaper non-cross-fitted tuner carries value for intermediate testing and exploration, but prefer cross-fitting for final model fitting. ## Is coverage valid at the tuned $\lambda$? Scenario 2 validated the covariance *formula* at a fixed $\lambda$, but the $\lambda$ we use is estimated from data, and `vcov()` treats $\lambda$ as fixed. As such, it does not propagate the uncertainty in $\hat{\lambda}$ into $\hat{\Sigma}_\gamma$. Additionally, choosing $\lambda$ on the same data used to estimate the item parameters induces finite-sample bias in those estimates (and, in principle, anti-conservative intervals). Split-sample (cross-fit) $\lambda$ estimation removes that bias. This scenario measures the size of the effect by comparing coverage of the true item parameters at three operating points: fixed $\lambda$ = 0.5, the same-data tuned $\hat{\lambda}$ with `vcov(best_fit)`, and the cross-fit tuned $\hat{\lambda}$ with `vcov(final_fit)`. It reuses the Scenario 2 seeds, so the fixed-$\lambda$ column reproduces Scenario 2 exactly. ```{r coverage-tuned-table} ct <- load_summary("coverage_tuned") if (is.null(ct)) { ct <- data.frame( label = c("R1 perfect (F=Y)", "R2 same-DGP draw", "R3 independent noise", "R4 LLM shift"), lambda_sd = c(0.751, 0.116, 0.053, 0.103), lambda_xf = c(0.859, 0.135, 0.074, 0.130), fixed_95 = c(0.955, 0.957, 0.961, 0.954), samedata_95 = c(0.955, 0.956, 0.958, 0.958), crossfit_95 = c(0.953, 0.956, 0.958, 0.958), bias_a_fixed = c(0.0097, 0.0280, 0.0365, 0.0207), bias_a_samedata = c(0.0063, 0.0224, 0.0286, 0.0159), bias_a_crossfit = c(0.0051, 0.0222, 0.0289, 0.0166) ) } show_cols(ct, c("label", "lambda_sd", "lambda_xf", "fixed_95", "samedata_95", "crossfit_95"), caption = paste("95% coverage rates of the true item parameters at the", "fixed, same-data-tuned, and cross-fit-tuned λ (200 reps).", "lambda_sd / lambda_xf are the mean selected λ for each tuner.")) ``` And the matching item-parameter (discrimination) bias: ```{r coverage-tuned-bias} show_cols(ct, c("label", "bias_a_fixed", "bias_a_samedata", "bias_a_crossfit"), caption = paste("Mean discrimination bias E[a-hat - a] at each operating", "point. Same-data vs cross-fit differ by <= 0.001 except R1.")) ``` **Takeaways.** - Same-data tuning produces nominal coverage (95% coverage 0.955–0.958), and the finite-sample bias is small: same-data vs. cross-fit discrimination bias differ by $\leq 0.001$ in every regime except R1. - Cross-fitting removes bias in the perfect predictor (R1), but this bias is small. - Cross-fitting gives the same coverage (0.953–0.958) while also guarding against finite-sample tuning bias. Cross-fitting is the principled default, despite the same-data tuning performing well. It is guaranteed free of the finite-sample tuning bias. ## Summary | Question | Result | |----------|--------| | Does $\lambda$-selection track predictor quality? | Yes: better predictors select higher $\lambda$ | | Do standard errors achieve appropriate coverage? | Yes: nominal coverage, including under a biased predictor (R3/R4) | | Does the method improve downstream scoring? | No average harm demonstrated and per-rep improvement in about half of reps | | What is the role of cross-fitting? | Cross-fitting removes finite sample bias, despite reporting similar bias/RMSE/coverage | | Is coverage valid at the tuned $\lambda$? | Yes | ## Reproducing these results From the package root (second argument is the number of cores): ```r # Rscript simulations/run_lambda_selection.R 100 8 # Rscript simulations/run_coverage.R 200 8 # Rscript simulations/run_downstream.R 100 8 # Rscript simulations/run_crossfit.R 50 8 # Rscript simulations/figures.R ``` See `simulations/README.md` for the full design, the deterministic per-task seeding (results are identical serially or in parallel), and interpretation notes.