Experimental. Per-item lambda uses a frozen expected-count approximation, not the full marginal-MML objective. The IRT marginal likelihood integrates the joint response pattern and does not decompose item-wise, so a theoretically principled per-item marginal objective does not yet exist in this package. Results from per-item tuning should be treated as approximate and validated against the scalar-lambda MML baseline. See the documentation of
tune_lambda_ability_risk_item()for details.
Why per-item lambda?
A single global is fragile: if even one item has poorly-correlated LLM predictions, the ability-risk criterion may force the global optimum to , preventing all items from benefiting. Per-item allows each item to draw on the LLM data at its own optimal level.
Consider a test with eight items where: - Items 1–4 are straightforward factual questions — the LLM predicts these well. - Items 5–8 require nuanced reasoning — the LLM is nearly random on these.
With a scalar , the four poor items push the optimum toward 0. With per-item , items 1–4 get –0.7 while items 5–8 get .
Simulate a heterogeneous test
library(mixedsubjectsirt)
library(ggplot2)
set.seed(2026)
n_human <- 400
n_generated <- 1200
n_items <- 8
n_good <- 4 # items where LLM predicts well
true_pars <- data.frame(
item = paste0("Item", seq_len(n_items)),
a = seq(0.8, 1.6, length.out = n_items),
d = seq(-1.1, 1.1, length.out = n_items)
)
true_pars$b <- -true_pars$d / true_pars$a
theta_human <- rnorm(n_human)
observed <- simulate_2pl(theta_human, true_pars)
# LLM: good for items 1–4 (same DGP), poor for 5–8 (random noise)
llm_pars_good <- true_pars
llm_pars_poor <- true_pars
llm_pars_poor$a <- pmax(0.05, rnorm(n_items, 0, 0.1)) # near-random
llm_pars_poor$d <- rnorm(n_items, 0, 2)
llm_pars_poor$b <- -llm_pars_poor$d / llm_pars_poor$a
# Build predicted (same subjects as human)
predicted <- observed # F = Y for first 4 items
predicted[, 5:8] <- simulate_2pl(theta_human, llm_pars_poor)[, 5:8]
# Build generated
generated_good <- simulate_2pl(rnorm(n_generated), true_pars)
generated_poor <- simulate_2pl(rnorm(n_generated), llm_pars_poor)
generated <- cbind(generated_good[, 1:4], generated_poor[, 5:8])
colnames(generated) <- true_pars$itemThe first four items have perfect paired predictions (F = Y); items 5–8 have near-random LLM predictions.
Step 1: Fit 2PL baseline and get global scalar lambda
human_pars <- fit_2pl(observed, technical = list(NCYCLES = 500))$pars
global_tuned <- tune_lambda_ability_risk(
lambda_grid = seq(0, 1, by = 0.1),
observed = observed,
predicted = predicted,
generated = generated,
initial_pars = human_pars,
fit_fn = fit_mixed_subjects_mml,
n_quad = 11,
control = list(maxit = 200)
)
cat("Global scalar best lambda:", global_tuned$best_lambda, "\n")
#> Global scalar best lambda: 0.3897205The global scalar is forced to a compromise — the four poor items constrain it to a value smaller than what items 1–4 could support.
Step 2: PPI++ score per item (fast diagnostic)
tune_lambda_ppi_score_item() applies the Proposition 2
formula independently per item using the 2×2 diagonal block of
and the item-level sub-vectors of the score matrices. This is fast (no
fitting required) and shows which items are well-predicted.
ppi_item <- tune_lambda_ppi_score_item(
observed = observed,
predicted = predicted,
item_pars = human_pars,
n_generated = n_generated,
n_quad = 11
)
cat("Per-item PPI++ lambda:\n")
#> Per-item PPI++ lambda:
print(data.frame(item = ppi_item$item, lambda = round(ppi_item$lambda, 3)))
#> item lambda
#> 1 Item1 0.75
#> 2 Item2 0.75
#> 3 Item3 0.75
#> 4 Item4 0.75
#> 5 Item5 0.00
#> 6 Item6 0.00
#> 7 Item7 0.00
#> 8 Item8 0.00
cat("N/(n+N) upper bound:", round(n_generated / (n_human + n_generated), 3), "\n")
#> N/(n+N) upper bound: 0.75Items 1–4 (F = Y) should show ; items 5–8 (random LLM) should show .
Step 3: Per-item ability-risk tuning
tune_lambda_ability_risk_item() uses coordinate descent:
for each item
,
it finds the
that minimizes ability-score risk while holding all other
fixed. By default each coordinate is solved by direct 1-D
optimization (continuous
);
pass method = "grid" to restrict it to the values in
lambda_grid (which otherwise only bounds the search). Each
evaluation fits with the frozen expected-count
Q-function (not the full marginal-MML objective) because the
IRT marginal likelihood does not decompose item-wise. Starting from the
global scalar optimum (not from all-zeros) is essential — see the note
below.
item_tuned <- tune_lambda_ability_risk_item(
lambda_grid = seq(0, 1, by = 0.25),
observed = observed,
predicted = predicted,
generated = generated,
initial_pars = human_pars,
init_lambda = global_tuned$best_lambda, # start from global best
n_quad = 11,
n_pass = 1,
control = list(maxit = 200)
)
cat("Per-item ability-risk lambda:\n")
#> Per-item ability-risk lambda:
print(data.frame(item = item_tuned$item, lambda = round(item_tuned$lambda, 3)))
#> item lambda
#> 1 Item1 0.39
#> 2 Item2 0.39
#> 3 Item3 0.39
#> 4 Item4 0.39
#> 5 Item5 0.00
#> 6 Item6 0.00
#> 7 Item7 0.00
#> 8 Item8 0.00Items 1–4 should receive positive (good predictor); items 5–8 should be near zero (poor predictor).
Step 4: Compare scalar vs. per-item parameter recovery
fit_scalar <- global_tuned$best_fit
fit_per_item <- item_tuned$final_fit
rmse <- function(x, y) sqrt(mean((x - y)^2))
comparison <- data.frame(
item = true_pars$item,
true_a = round(true_pars$a, 3),
human_a = round(human_pars$a, 3),
scalar_a = round(fit_scalar$item_pars$a, 3),
item_a = if (is.null(fit_per_item)) NA_real_ else
round(fit_per_item$item_pars$a, 3)
)
knitr::kable(comparison, row.names = FALSE,
caption = "Discrimination recovery: scalar lambda vs. per-item lambda")| item | true_a | human_a | scalar_a | item_a |
|---|---|---|---|---|
| Item1 | 0.800 | 0.727 | 0.751 | 0.715 |
| Item2 | 0.914 | 1.256 | 1.161 | 1.109 |
| Item3 | 1.029 | 1.117 | 1.125 | 1.026 |
| Item4 | 1.143 | 1.091 | 1.040 | 1.038 |
| Item5 | 1.257 | 1.758 | 1.583 | 1.766 |
| Item6 | 1.371 | 1.610 | 1.526 | 1.618 |
| Item7 | 1.486 | 1.369 | 1.457 | 1.371 |
| Item8 | 1.600 | 1.925 | 2.227 | 1.921 |
cat("RMSE(a) human-only: ",
round(rmse(human_pars$a, true_pars$a), 4), "\n")
#> RMSE(a) human-only: 0.2645
cat("RMSE(a) scalar MML: ",
round(rmse(fit_scalar$item_pars$a, true_pars$a), 4), "\n")
#> RMSE(a) scalar MML: 0.2755
if (!is.null(fit_per_item)) {
cat("RMSE(a) per-item MML: ",
round(rmse(fit_per_item$item_pars$a, true_pars$a), 4), "\n")
}
#> RMSE(a) per-item MML: 0.2479Important note on initialization
Starting coordinate descent from all-zeros is not recommended. When all other items are at , each single-item improvement is diluted across the full ability-risk criterion, making improvements hard to detect. The recommended workflow is:
- Fit the scalar global optimum with
tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects_mml). - Pass that global value as
init_lambdatotune_lambda_ability_risk_item(). - The coordinate descent then identifies which items should deviate from the global default — raising for well-predicted items and lowering it for poorly-predicted ones.
Approximation caveat
The per-item lambda coordinate descent uses the frozen
Q-function (not the full marginal-MML objective) for each
candidate evaluation. This is necessary because the IRT marginal
likelihood integrates the joint response pattern and does not decompose
item-wise. The approximation is good when initial_pars is
close to the converged parameters. For final reporting, always:
- Compare the per-item fit against the scalar MML baseline.
- Report per-item results as approximate / experimental.
- Use
vcov(item_tuned$final_fit)for uncertainty (which appliesvcov_mixed_subjectswith the vector-lambda bread and meat scaling).