Choosing Lambda in Mixed-Subjects IRT

The mixed-subjects estimator calibrates a 2PL IRT model by minimizing a PPI++-style combined loss over observed human responses, paired predicted responses, and additional generated responses. Predicted and generated responses will, in practice, typically come from an LLM.

This vignette explains how to choose $\lambda$ and which functions to use. The short answer:

Task	Function
Recommended workflow (cross-fit)	`tune_lambda_ability_risk_crossfit()`
Exploratory tuning	`tune_lambda_ability_risk()`
Fast approximation (frozen EC)	`tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects)`
Theoretical diagnostic	`tune_lambda_ppi_score()`
Per-item tuning (experimental)	`tune_lambda_ability_risk_item()`

All of these now use the marginal-MML estimator ([fit_mixed_subjects_mml()]) by default; the frozen expected-count estimator is available via fit_fn = fit_mixed_subjects but is discouraged (see below).

Two objectives, two estimators

Why there are two lambda objectives:

tune_lambda_ppi_score() minimizes $\text{Tr}(\Sigma_\gamma)$ . This is the trace of the item-parameter covariance matrix and the tuning rule from the original PPI++ implementation.[^1] This is most useful as a tuning target when doing inference on estimated parameters (such as in causal inference applications).[^2] Here, this is merely implemented as a theoretical diagnostic.
tune_lambda_ability_risk() minimizes $E[g'\Sigma_\gamma g]$ . This is the propagated ability-score risk. This is the preferred practical criterion for IRT applications and optimizes the choice of $\lambda$ to minimize measurement error in downstream ability estimation. Importantly, the PPI++ objective only cares about the trace of the $\Sigma_\gamma$ , while this method uses information about scale structure derived from the off-diagonal covariance terms. Use this for operational calibration.

Why there are two estimators:

fit_mixed_subjects_mml() is the default estimator for every tuner except tune_lambda_ability_risk_item. It is a consistent iterative MML-based estimator that marginalizes over the ability distribution. The implementation recomputes posteriors at every gradient evaluation, and its sandwich covariance (vcov_mixed_subjects_mml, called automatically via vcov()) uses Louis’ observed-information correction for the bread, providing proper coverage.[^3]

The original fit_mixed_subjects() uses frozen expected-count posteriors and is still available by passing fit_fn = fit_mixed_subjects, but it is highly discouraged: the frozen posteriors create a gradient asymmetry when the LLM item parameters differ from human parameters, systematically inflating discriminations and driving tune_lambda_ability_risk to select $\lambda = 0$ even when the LLM is genuinely informative (it also requires a slope_upper cap for stability). Use it only as a fast approximation when computation time is a binding constraint. Note that the current implementation of tune_lambda_ability_risk_item uses this estimation procedure and should be considered experimental.

Example data

library(mixedsubjectsirt)

set.seed(242424)

n_human     <- 120
n_generated <- 350
n_items     <- 5

human_pars <- data.frame(
  item = paste0("Item", seq_len(n_items)),
  a    = c(0.8, 1.0, 1.2, 1.35, 1.55),
  d    = c(-0.8, -0.3, 0.1, 0.45, 0.9)
)
human_pars$b <- -human_pars$d / human_pars$a

llm_pars   <- human_pars
llm_pars$a <- pmax(0.35, 0.9 * human_pars$a)
llm_pars$d <- human_pars$d + 0.25
llm_pars$b <- -llm_pars$d / llm_pars$a

theta_human <- rnorm(n_human)
observed    <- simulate_2pl(theta_human, human_pars)
predicted   <- simulate_2pl(theta_human, llm_pars)
generated   <- simulate_2pl(rnorm(n_generated), llm_pars)

lambda_grid <- c(0, 0.25, 0.5, 0.75, 1)

The examples use human_pars as initial_pars for speed.

Ability-risk tuning: Minimizing $\mathbb{E}[g'\Sigma_\gamma g]$

The key result from the Linking and Gradient Asymmetry vignette is that when the LLM parameters differ from human parameters, the frozen expected-count estimator drives $\lambda \to 0$ due to an artificial gradient asymmetry. The default MML estimator removes this asymmetry, so with a good predictor ( $F = Y$ ) tune_lambda_ability_risk() correctly selects $\lambda > 0$ . Because fit_mixed_subjects_mml() is the default fit_fn, no estimator argument is needed.

By default tune_lambda_ability_risk() chooses $\lambda$ by direct optimization of the risk over [0, 1]. Here we pass method = "grid" with lambda_grid so the summary shows the risk at each candidate, which handy for visualizing the $\lambda$ -risk surface. In practice you will likely omit both arguments and let it optimize $\lambda$ directly.

ability_tuned_mml <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_pars,
  method       = "grid",                # show the risk at each candidate value
  control      = list(maxit = 100)
)

ability_tuned_mml$summary[, c("lambda", "mean_param_var", "mean_total_risk",
                               "convergence", "selection_risk")]
#>   lambda mean_param_var mean_total_risk convergence selection_risk
#> 1   0.00      0.5748958       0.5748958           0      0.5748958
#> 2   0.25      0.6728206       0.6728206           0      0.6728206
#> 3   0.50      1.0477476       1.0477476           0      1.0477476
#> 4   0.75      4.3282612       4.3282612           0      4.3282612
#> 5   1.00    113.3301603     113.3301603           0    113.3301603
ability_tuned_mml$best_lambda
#> [1] 0

tune_lambda_ability_risk(
  observed = observed, predicted = predicted, generated = generated,
  target_resp = observed, initial_pars = human_pars,
  control = list(maxit = 100)
)$best_lambda
#> [1] 0

selection_risk is Inf for any candidate with non-zero convergence code or non-finite risk, protecting selection from numerical failures.

We can inspect the components of the ability risk:


risk <- ability_risk(
  resp        = observed,
  fit_or_pars = ability_tuned_mml$best_fit,
  vcov        = vcov(ability_tuned_mml$best_fit)
)
risk$summary
#>   mean_param_var mean_squared_error mean_total_risk
#> 1      0.5748958                 NA       0.5748958

mean_squared_error is NA because theta_true was not supplied; it is only computed in simulation studies. mean_param_var is the propagated item-parameter uncertainty. mean_total_risk is the ability risk target that the method optimizes for.

In simulation studies, pass theta_true = theta_human to also include squared ability-estimation error:

ability_tuned_truth <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  theta_true   = theta_human,
  initial_pars = human_pars,
  fit_fn       = fit_mixed_subjects_mml,
  method       = "grid",
  control      = list(maxit = 100)
)

ability_tuned_truth$summary[, c("lambda", "mean_param_var",
                                 "mean_squared_error", "mean_total_risk")]
#>   lambda mean_param_var mean_squared_error mean_total_risk
#> 1   0.00      0.5748958           4.455476        5.030372
#> 2   0.25      0.6728206           4.481827        5.154648
#> 3   0.50      1.0477476           4.517292        5.565039
#> 4   0.75      4.3282612           4.542339        8.870600
#> 5   1.00    113.3301603           6.078139      119.408300

Cross-fit $\lambda$ tuning (recommended workflow)

tune_lambda_ability_risk_crossfit() is the recommended workflow for all serious analyses: it estimates $\lambda$ per fold on the other folds’ labels, so the selected $\lambda$ is not tuned on the same rows it is applied to. By default both the fold tuning (fit_fn) and the final fit (final_fit_fn) use the MML estimator, and the fold $\lambda$ ’s are averaged (weighted by fold size) into a single scalar for the final full-sample fit. Two further arguments:

target_mode = "fixed" (default): the full target_resp is used for every fold’s risk evaluation, which is correct when the target is an operational scoring population independent of the calibration sample.
target_mode = "row_aligned": subsets target_resp to training rows, valid when target_resp = observed.

split_id <- rep(1:2, length.out = nrow(observed))

crossfit_tuned <- tune_lambda_ability_risk_crossfit(
  lambda_grid  = c(0, 0.5, 1),
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  split_id     = split_id,
  initial_pars = human_pars,
  n_quad       = 7,
  control      = list(maxit = 100)
)

crossfit_tuned$lambda_by_split   # per-fold tuned lambda
#> [1] 0.05572809 0.00000000
crossfit_tuned$lambda_final      # fold-size-weighted scalar used for the final fit
#> [1] 0.02786405

Frozen expected-count estimator (fast approximation)

Passing fit_fn = fit_mixed_subjects selects the older frozen expected-count estimator. It is faster than the default MML path but produces inflated discriminations when the LLM parameters differ from human parameters, driving $\lambda \to 0$ even when the LLM is informative. As such, it is strongly discouraged.

ability_tuned_ec <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_pars,
  fit_fn       = fit_mixed_subjects,   # opt in to the frozen EC estimator
  n_quad       = 7,
  slope_upper  = 4,           # required to prevent divergence
  control      = list(maxit = 100)
)

ability_tuned_ec$best_lambda
#> [1] 0.007928954

slope_upper = 4 is required to prevent discriminations from diverging. The default MML estimator does not need this cap because it has no false minimum at large discrimination values.

Minimizing $\text{Tr}\big[\Sigma_\gamma\big]$ (diagnostic only)

tune_lambda_ppi_score() estimates the PPI++ Proposition 2 lambda using the same human posterior weights for both human and paired-LLM score vectors. F = Y (identical predictions) gives exactly $N/(n+N)$ .

ppi_score <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = predicted,
  item_pars   = human_pars,
  n_generated = nrow(generated)
)

cat("PPI++ score lambda:", round(ppi_score$lambda, 3),
    " r =", round(ppi_score$r, 3),
    " N/(n+N) =", round(1 / (1 + ppi_score$r), 3), "\n")
#> PPI++ score lambda: 0  r = 0.343  N/(n+N) = 0.745

A value near zero means the paired LLM responses do not systematically reduce gradient variance in the person-level score formulation. This is expected for stochastic binary LLM responses. For $F = Y$ the formula recovers $N/(n+N)$ .

ppi_score_match <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = observed,
  item_pars   = human_pars,
  n_generated = nrow(generated)
)

cat("PPI++ score lambda:", round(ppi_score_match$lambda, 3),
    " r =", round(ppi_score$r, 3),
    " N/(n+N) =", round(1 / (1 + ppi_score_match$r), 3), "\n")
#> PPI++ score lambda: 0.745  r = 0.343  N/(n+N) = 0.745

Choosing a procedure

Procedure	Objective	When to use
`tune_lambda_ability_risk_crossfit()`	$\mathbb{E}[g'\Sigma_\gamma g]$ , computed out-of-fold	Recommended workflow
`tune_lambda_ability_risk()`	$\mathbb{E}[g'\Sigma_\gamma g]$ , Louis bread (MML default)	Exploratory single-sample tuning
`tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects)`	$\mathbb{E}[g'\Sigma_\gamma g]$ , EM bread	Fast approximation; discouraged, requires `slope_upper`
`tune_lambda_ability_risk_item()`	Per-item risk (approx.)	Experimental; some items poor predictors
`tune_lambda_ppi_score()`	$\text{Tr}(\Sigma_\gamma)$	Theoretical diagnostic only

Also note that the target population matters, as ability risk is integrated over that population’s ability distribution. target_resp = observed tunes for the observed human response patterns. In operational scoring, you may use a larger target matrix representing the actual scoring population while calibrating item parameters on a pilot sample.