The mixed-subjects estimator calibrates a 2PL IRT model by minimizing a PPI++-style combined loss over observed human responses, paired predicted responses, and additional generated responses. Predicted and generated responses will, in practice, typically come from an LLM.
This vignette explains how to choose and which functions to use. The short answer:
| Task | Function |
|---|---|
| Recommended workflow (cross-fit) | tune_lambda_ability_risk_crossfit() |
| Exploratory tuning | tune_lambda_ability_risk() |
| Fast approximation (frozen EC) | tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects) |
| Theoretical diagnostic | tune_lambda_ppi_score() |
| Per-item tuning (experimental) | tune_lambda_ability_risk_item() |
All of these now use the marginal-MML estimator
([fit_mixed_subjects_mml()]) by default; the frozen
expected-count estimator is available via
fit_fn = fit_mixed_subjects but is discouraged (see
below).
Two objectives, two estimators
Why there are two lambda objectives:
tune_lambda_ppi_score()minimizes . This is the trace of the item-parameter covariance matrix and the tuning rule from the original PPI++ implementation.[^1] This is most useful as a tuning target when doing inference on estimated parameters (such as in causal inference applications).[^2] Here, this is merely implemented as a theoretical diagnostic.tune_lambda_ability_risk()minimizes . This is the propagated ability-score risk. This is the preferred practical criterion for IRT applications and optimizes the choice of to minimize measurement error in downstream ability estimation. Importantly, the PPI++ objective only cares about the trace of the , while this method uses information about scale structure derived from the off-diagonal covariance terms. Use this for operational calibration.
Why there are two estimators:
fit_mixed_subjects_mml() is the default estimator for
every tuner except tune_lambda_ability_risk_item.
It is a consistent iterative MML-based estimator that marginalizes over
the ability distribution. The implementation recomputes posteriors at
every gradient evaluation, and its sandwich covariance
(vcov_mixed_subjects_mml, called automatically via
vcov()) uses Louis’ observed-information correction for the
bread, providing proper coverage.[^3]
The original fit_mixed_subjects() uses frozen
expected-count posteriors and is still available by passing
fit_fn = fit_mixed_subjects, but it is highly
discouraged: the frozen posteriors create a gradient asymmetry
when the LLM item parameters differ from human parameters,
systematically inflating discriminations and driving
tune_lambda_ability_risk to select
even when the LLM is genuinely informative (it also requires a
slope_upper cap for stability). Use it only as a fast
approximation when computation time is a binding constraint. Note that
the current implementation of tune_lambda_ability_risk_item
uses this estimation procedure and should be considered
experimental.
Example data
library(mixedsubjectsirt)
set.seed(242424)
n_human <- 120
n_generated <- 350
n_items <- 5
human_pars <- data.frame(
item = paste0("Item", seq_len(n_items)),
a = c(0.8, 1.0, 1.2, 1.35, 1.55),
d = c(-0.8, -0.3, 0.1, 0.45, 0.9)
)
human_pars$b <- -human_pars$d / human_pars$a
llm_pars <- human_pars
llm_pars$a <- pmax(0.35, 0.9 * human_pars$a)
llm_pars$d <- human_pars$d + 0.25
llm_pars$b <- -llm_pars$d / llm_pars$a
theta_human <- rnorm(n_human)
observed <- simulate_2pl(theta_human, human_pars)
predicted <- simulate_2pl(theta_human, llm_pars)
generated <- simulate_2pl(rnorm(n_generated), llm_pars)
lambda_grid <- c(0, 0.25, 0.5, 0.75, 1)The examples use human_pars as initial_pars
for speed.
Ability-risk tuning: Minimizing
The key result from the Linking and
Gradient Asymmetry vignette is that when the LLM parameters differ
from human parameters, the frozen expected-count estimator drives
due to an artificial gradient asymmetry. The default MML estimator
removes this asymmetry, so with a good predictor
()
tune_lambda_ability_risk() correctly selects
.
Because fit_mixed_subjects_mml() is the default
fit_fn, no estimator argument is needed.
By default tune_lambda_ability_risk() chooses
by direct optimization of the risk over [0, 1]. Here we
pass method = "grid" with lambda_grid so the
summary shows the risk at each candidate, which handy for
visualizing the
-risk
surface. In practice you will likely omit both arguments and let it
optimize
directly.
ability_tuned_mml <- tune_lambda_ability_risk(
lambda_grid = lambda_grid,
observed = observed,
predicted = predicted,
generated = generated,
target_resp = observed,
initial_pars = human_pars,
method = "grid", # show the risk at each candidate value
control = list(maxit = 100)
)
ability_tuned_mml$summary[, c("lambda", "mean_param_var", "mean_total_risk",
"convergence", "selection_risk")]
#> lambda mean_param_var mean_total_risk convergence selection_risk
#> 1 0.00 0.5748958 0.5748958 0 0.5748958
#> 2 0.25 0.6728206 0.6728206 0 0.6728206
#> 3 0.50 1.0477476 1.0477476 0 1.0477476
#> 4 0.75 4.3282612 4.3282612 0 4.3282612
#> 5 1.00 113.3301603 113.3301603 0 113.3301603
ability_tuned_mml$best_lambda
#> [1] 0
tune_lambda_ability_risk(
observed = observed, predicted = predicted, generated = generated,
target_resp = observed, initial_pars = human_pars,
control = list(maxit = 100)
)$best_lambda
#> [1] 0selection_risk is Inf for any candidate
with non-zero convergence code or non-finite risk, protecting selection
from numerical failures.
We can inspect the components of the ability risk:
risk <- ability_risk(
resp = observed,
fit_or_pars = ability_tuned_mml$best_fit,
vcov = vcov(ability_tuned_mml$best_fit)
)
risk$summary
#> mean_param_var mean_squared_error mean_total_risk
#> 1 0.5748958 NA 0.5748958mean_squared_error is NA because
theta_true was not supplied; it is only computed in
simulation studies. mean_param_var is the propagated
item-parameter uncertainty. mean_total_risk is the ability
risk target that the method optimizes for.
In simulation studies, pass theta_true = theta_human to
also include squared ability-estimation error:
ability_tuned_truth <- tune_lambda_ability_risk(
lambda_grid = lambda_grid,
observed = observed,
predicted = predicted,
generated = generated,
target_resp = observed,
theta_true = theta_human,
initial_pars = human_pars,
fit_fn = fit_mixed_subjects_mml,
method = "grid",
control = list(maxit = 100)
)
ability_tuned_truth$summary[, c("lambda", "mean_param_var",
"mean_squared_error", "mean_total_risk")]
#> lambda mean_param_var mean_squared_error mean_total_risk
#> 1 0.00 0.5748958 4.455476 5.030372
#> 2 0.25 0.6728206 4.481827 5.154648
#> 3 0.50 1.0477476 4.517292 5.565039
#> 4 0.75 4.3282612 4.542339 8.870600
#> 5 1.00 113.3301603 6.078139 119.408300Cross-fit tuning (recommended workflow)
tune_lambda_ability_risk_crossfit() is the recommended
workflow for all serious analyses: it estimates
per fold on the other folds’ labels, so the selected
is not tuned on the same rows it is applied to. By default both the fold
tuning (fit_fn) and the final fit
(final_fit_fn) use the MML estimator, and the fold
’s
are averaged (weighted by fold size) into a single scalar for the final
full-sample fit. Two further arguments:
-
target_mode = "fixed"(default): the fulltarget_respis used for every fold’s risk evaluation, which is correct when the target is an operational scoring population independent of the calibration sample. -
target_mode = "row_aligned": subsetstarget_respto training rows, valid whentarget_resp = observed.
split_id <- rep(1:2, length.out = nrow(observed))
crossfit_tuned <- tune_lambda_ability_risk_crossfit(
lambda_grid = c(0, 0.5, 1),
observed = observed,
predicted = predicted,
generated = generated,
split_id = split_id,
initial_pars = human_pars,
n_quad = 7,
control = list(maxit = 100)
)
crossfit_tuned$lambda_by_split # per-fold tuned lambda
#> [1] 0.05572809 0.00000000
crossfit_tuned$lambda_final # fold-size-weighted scalar used for the final fit
#> [1] 0.02786405Frozen expected-count estimator (fast approximation)
Passing fit_fn = fit_mixed_subjects selects the older
frozen expected-count estimator. It is faster than the default MML path
but produces inflated discriminations when the LLM parameters differ
from human parameters, driving
even when the LLM is informative. As such, it is strongly
discouraged.
ability_tuned_ec <- tune_lambda_ability_risk(
lambda_grid = lambda_grid,
observed = observed,
predicted = predicted,
generated = generated,
target_resp = observed,
initial_pars = human_pars,
fit_fn = fit_mixed_subjects, # opt in to the frozen EC estimator
n_quad = 7,
slope_upper = 4, # required to prevent divergence
control = list(maxit = 100)
)
ability_tuned_ec$best_lambda
#> [1] 0.007928954slope_upper = 4 is required to prevent discriminations
from diverging. The default MML estimator does not need this cap because
it has no false minimum at large discrimination values.
Minimizing (diagnostic only)
tune_lambda_ppi_score() estimates the PPI++ Proposition
2 lambda using the same human posterior weights for both human and
paired-LLM score vectors. F = Y (identical predictions) gives exactly
.
ppi_score <- tune_lambda_ppi_score(
observed = observed,
predicted = predicted,
item_pars = human_pars,
n_generated = nrow(generated)
)
cat("PPI++ score lambda:", round(ppi_score$lambda, 3),
" r =", round(ppi_score$r, 3),
" N/(n+N) =", round(1 / (1 + ppi_score$r), 3), "\n")
#> PPI++ score lambda: 0 r = 0.343 N/(n+N) = 0.745A value near zero means the paired LLM responses do not systematically reduce gradient variance in the person-level score formulation. This is expected for stochastic binary LLM responses. For the formula recovers .
ppi_score_match <- tune_lambda_ppi_score(
observed = observed,
predicted = observed,
item_pars = human_pars,
n_generated = nrow(generated)
)
cat("PPI++ score lambda:", round(ppi_score_match$lambda, 3),
" r =", round(ppi_score$r, 3),
" N/(n+N) =", round(1 / (1 + ppi_score_match$r), 3), "\n")
#> PPI++ score lambda: 0.745 r = 0.343 N/(n+N) = 0.745Choosing a procedure
| Procedure | Objective | When to use |
|---|---|---|
tune_lambda_ability_risk_crossfit() |
, computed out-of-fold | Recommended workflow |
tune_lambda_ability_risk() |
, Louis bread (MML default) | Exploratory single-sample tuning |
tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects) |
, EM bread | Fast approximation; discouraged, requires
slope_upper
|
tune_lambda_ability_risk_item() |
Per-item risk (approx.) | Experimental; some items poor predictors |
tune_lambda_ppi_score() |
Theoretical diagnostic only |
Also note that the target population matters, as ability risk is
integrated over that population’s ability distribution.
target_resp = observed tunes for the observed human
response patterns. In operational scoring, you may use a larger target
matrix representing the actual scoring population while calibrating item
parameters on a pilot sample.