Understanding Ability-Risk Tuning

Why this vignette exists

This vignette is meant to alleviate some common misunderstandings of mixed-subjects IRT calibration. While the core idea of augmenting human data with LLM-generated data and estimating a $\lambda$ value to tune how much you want the LLM-generated to contribute is straightforward, the nuances of the ability-risk objective are not. In this vignette, we will go into the data requirements for mixed-subject calibrations and the ability-risk objective itself to help set users up for success. If there is one major takeaway from this document, it is that the tuning parameter $\lambda$ is not asking whether LLM-generated responses look human in the aggregate. It is, instead, asking whether an LLM-based response-generation procedure can predict the row-level human response structure enough to reduce downstream ability-estimation error. The most important object is therefore the paired prediction matrix, $P$ .

Key Intuition

The three response matrices

Mixed-subjects IRT requires three item response matrices. Let $J$ be the number of items, $n$ be the number of observed human respondents, and $N$ be the number of additional generated respondents.

1. Observed human responses: $O$

This is the real human pilot calibration response matrix, with structure

$O \in \{0,1\}^{n \times J}.$

Each row is the full observed response string from one human respondent. Each column corresponds to an item. Each entry is an observed, dichotomously scored, response.

2. Paired LLM-predicted human responses: $P$

This is the LLM-predicted response matrix for the same human respondents in $O$ , with structure

$P \in \{0,1\}^{n \times J}.$

Pay special attention to the requirement of “same human respondents.” Row $i$ in $P$ must be predicted responses for the respondent in row $i$ in $O$ . Column $j$ in $P$ must correspond to column $j$ in $O$ . This is the diagnostic matrix. It tells the method whether the LLM response-generation procedure is informative about the human response process and defines the magnitude and confidence in the mixed-subjects correction.

Importantly, this means that when generating $P$ , users need to transmit both the content of the items being responded to and some sort of information about the rows of $O$ to the LLM generating the predicted responses. This can be in the form of narrative or covariate information about the respondents, held out item responses, or something else.

3. Additional LLM-generated responses: $G$

This is the larger synthetic or LLM-generated response matrix, with structure

$G \in \{0,1\}^{N \times J}.$

Typically, $N \gg n$ to maximize the potential improvement in post-calibration ability estimation precision. The rows in $G$ are not paired with rows in $O$ , but instead additional generated respondents. However, the crucial requirement is that $G$ is meant to be sampled from the same distribution as $P$ , meaning it should be created using a procedure that is as close as possible to the procedure used to create $P$ and some amount of information about the ability distribution in $P$ .

A useful way to remember the design is:

Matrix	Shape	Rows	Purpose
$O$	$n \times J$	observed human respondents	anchor the calibration target to the human population
$P$	$n \times J$	LLM predictions for those same human respondents	estimate how the LLM procedure agrees with or deviates from human responses
$G$	$N \times J$	additional LLM-generated rows	provide extra precision derived from synthetic information after correction

Another helpful rule of thumb is that if $P$ is not row-aligned with $O$ , you should expect $\lambda \to 0$ .

The mixed-subjects IRT objective

The recommended scalar- $\lambda$ workflow in mixedsubjectsirt uses a marginal maximum likelihood (MML) objective of the form

$\hat\gamma_\lambda =\arg\min_\gamma\big[L_O^{\mathrm{marg}}(\gamma)+\lambda\big(L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\big)\big].$

Here $\gamma = \{a_1, \ldots, a_J, d_1, \ldots, d_J \}$ is the vector of item parameters. For a 2PL model, we write item $j$ ’s response probability as

$p_j(\theta;\gamma_j) =\mathrm{logit}^{-1}(d_j + a_j \theta),$

where $d_j$ is the intercept and $a_j$ is the discrimination.

The three pieces of the objective each have different jobs in the mixed-subjects loss: $L_O^{\mathrm{marg}}(\gamma)$ is the usual human-data marginal IRT likelihood evaluated at $\gamma$ . This term anchors the calibration to humans. $L_G^{\mathrm{marg}}(\gamma)$ is the likelihood contribution from the additional generated response rows. $L_P^{\mathrm{marg}}(\gamma)$ is the correction term. It estimates what the same LLM-generation procedure says about the humans whose actual responses are observed.

The term $L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}$ is the prediction-powered correction. The generated matrix $G$ adds information, while the paired prediction matrix $P$ allows us to estimate the LLM procedure’s bias, noise, and covariance with the human response process.

What lambda is learning

The tuning parameter $\lambda$ weights how much the LLM-generated responses are allowed to contribute to parameter estimation. When $\lambda = 0$ ,

$L_O^{\mathrm{marg}}(\gamma) + \lambda \{L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\} = L_O^{\mathrm{marg}}(\gamma),$

so the method falls back to human-only calibration. When $\lambda > 0$ , the method borrows information from $G$ , corrected by $P$ .

This means $\lambda=0$ is not a failure of the software, instead it is a protective outcome that occurs when the LLM responses do not reduce expected ability-estimation error. A high $\lambda$ requires more than plausible synthetic response rows. It requires the paired LLM predictions in $P$ to track the human responses in $O$ in a way that is consistent across respondents.

Ability-risk tuning

There are two related (but distinct) tuning ideas in the package. The original PPI++ paper minimizes the standard errors of the estimated parameters (the trace of the covariance matrix). The function tune_lambda_ppi_score() implements this.

The function tune_lambda_ability_risk() asks a more practical psychometric question: Which value of $\lambda$ minimizes expected downstream ability-estimation error?

The approximate target is $\widehat R(\lambda) = \frac{1}{M} \sum_{m=1}^{M} g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m,$ where:

$y_m$ is a target response pattern;
$\hat\theta(y_m;\hat\gamma_\lambda)$ is the ability estimate produced from that response pattern;
$g_m = \nabla_\gamma \hat\theta(y_m;\hat\gamma_\lambda)$ is gradient of the ability estimate, which captures the sensitivity of the ability estimate to item-parameter error;
$\widehat\Sigma_{\gamma,\lambda}$ is the full covariance matrix of the item-parameter estimates under the mixed-subjects fit.

$g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m$ thus propagates item-parameter uncertainty and covariance structure into ability-estimation uncertainty.

This matters because ability-risk tuning is not the same as minimizing average item-parameter standard errors. The off-diagonal elements of the item-parameter covariance matrix now matter, describing how uncertainty in one item parameter moves with uncertainty in another. Some covariance patterns may cancel out for ability scoring; others may distort the scale in high-information regions. Ability-risk tuning weights these covariance patterns by their downstream impact on $\hat\theta$ , averaged over an expected ability distribution.

Why row alignment matters

The easiest way to understand $\lambda$ is to compare three cases.

Case A: perfect paired prediction

Suppose

$P = O.$

Here, the paired LLM prediction is exactly equal to the human response matrix. From the perspective of maximizing the contribution of $G$ to estimation, this is the best possible version of $P$ . In this case, $\lambda$ should be large, though not necessarily $\lambda = 1$ . The finite- $N$ benchmark is

$\lambda_{\max} = \frac{1}{1+n/N} = \frac{N}{n+N}.$

If $n=400$ and $N=1200$ , we see

$\lambda_{\max} = \frac{1200}{400+1200}=0.75.$

So even when $P=O$ , if $N \not\gg n$ , $\lambda < 1$ should be expected.

Case B: row-shuffled perfect predictions

Now suppose

$P = \mathrm{shuffle\_rows}(O).$

The marginal item means are identical. The total-score distribution is identical. The item difficulty information is identical. The only thing that’s changed is row $i$ in $P$ is no longer a prediction for row $i$ in $O$ . This should produce $\lambda \approx 0$ , because the row-aligned covariance structure has been destroyed. To observe this for both $\lambda$ tuning objectives, you can run:

predicted_shuffled <- observed[sample(nrow(observed)), ]

lambda_shuffled_ppi <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ppi$lambda

lambda_shuffled_ability_risk <- tune_lambda_ability_risk(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ability_risk$lambda

Case C: same DGP, fresh Bernoulli draw

Suppose both $O$ and $P$ are generated from the same IRT model:

$O_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\},$

$P_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\}.$

This is “same DGP,” but $P$ is still a fresh stochastic response. It is not the same as $O$ . The two matrices share person ability and item parameters, but they do not share response noise.

For a single item,

$\operatorname{Cov}(O_{ij},P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)].$

But

$\operatorname{Var}(P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)] + \mathbb{E}_\theta[p_j(\theta)\{1-p_j(\theta)\}].$

The Bernoulli noise in $P$ dilutes the control-variate signal. As a result, a fresh same-DGP draw may produce a modest $\lambda$ , not a large one. This distinction is important: Merely producing the same item parameters does not imply strong paired prediction.

What kind of LLM data produces higher lambda?

A useful $P$ matrix has to predict row-level response structure. A good $P$ should have these properties:

Same rows as $O$ : row $i$ in $P$ predicts row $i$ in $O$ .
Same item columns as $O$ .
Target response not leaked: when predicting $P_{ij}$ , the prompt must not include $O_{ij}$ .
Construct-relevant respondent information: covariates or context should help infer the respondent’s likely response.
Within-person response structure: the LLM should be able to infer something about the respondent and their knowledge or ability level from other responses or covariates.
Same procedure for $P$ and $G$ : the generated matrix should be produced by an analogous procedure to the paired predicted matrix.

One approach to row alignment: leave-one-item-out prediction

When you have a human response matrix $O$ , an approach to build $P$ is leave-one-item-out response prediction. For each respondent $i$ and item $j$ :

Mask response $O_{ij}$ .
Give the LLM the text and responses for the other items $O_{i,-j}$ .
Give the LLM the text for item $j$ .
Ask it to predict the response to item $j$ .
Store the result as $P_{ij}$ .

Another approach: covariate-based prediction

If you have construct-relevant covariates, you can build $P$ without using other item responses or augment the LOO-response prediction approach outlined above.

Example covariates include:

grade level;
age;
prior achievement;
language background;
prior placement scores;
response-time or engagement indicators;
classroom, school, or instructional context;
demographic variables, where appropriate and ethically justified.

Something that probably won’t work: item-text-only generation

Item-text-only generation usually predicts column properties (like item parameters and relative item spacings), not row-aligned responses. Row-aligned responses are important because they allow the method to link the underlying ability distributions between human respondents and LLM-generated respondents. This is why an item-text-only approach may produce synthetic data that looks plausible in aggregate, but still produces $\lambda=0$ .

How to generate $G$

The generated matrix $G$ should be produced using a procedure that mirrors the procedure used to produce $P$ .

If $P$ is generated with leave-one-item-out prompts, then $G$ should also be generated with leave-one-item-out or masked-item prompts.

One possible procedure:

Sample or resample a respondent profile.
Create a partial response context.
Mask one target item.
Ask the LLM to predict the masked response.
Repeat until a full generated response row is built.

The most important rule is that $G$ and $P$ should be generated by the same prediction mechanism. If $P$ is generated with row-aligned covariates and response history, but $G$ is generated by asking the LLM to invent full response strings from scratch, then $L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}$ may not be zero in expectation, producing asymptotically biased parameter estimates.

Summary

Mixed-subjects IRT is useful when the LLM response-generation procedure captures respondent-level response structure. The generated matrix $G$ matters, but the paired matrix $P$ is what lets the method learn whether $G$ should be trusted. The key understanding is that $G$ supplies extra synthetic rows, but $P$ tells us how much to trust them.

If $P$ is row-aligned and predictive of $O$ , $\lambda$ can be positive and the generated data can improve calibration. If $P$ only reproduces marginal item difficulty, or if its rows are not aligned with $O$ , then $\lambda$ should shrink toward zero. This is a feature, not a bug.

Technical Explanation

The Choosing Lambda vignette explains which tuning function to use and when. This section derives the mathematics those functions implement: how item-parameter uncertainty is propagated into ability scores, and why minimizing the resulting ability risk is a fundamentally different objective from the original PPI++ trace criterion.

Throughout, let $\gamma = \{a_1, \dots, a_J, d_1, \dots, d_J\}$ collect the $2J$ item parameters of a $J$ -item 2PL model, ordered as all discriminations followed by all intercepts. This is the ordering convention used by package functions like fit$par, vcov(), and ability_gradient(). The item response function is again

$p_{j}(\theta;\gamma_j) \;=\; \Pr(Y_j = 1 \mid \theta) \;=\; \operatorname{logit}^{-1}\!\big(d_j + a_j \theta\big).$

Overview: four objects, one objective

Ability-risk tuning chains four quantities together:

The mixed-subjects estimator $\hat\gamma(\lambda)$ , a function of the tuning parameter $\lambda$ .
Its sandwich covariance $\Sigma_\gamma(\lambda) = \operatorname{Cov}(\hat\gamma)$ .
The ability estimate $\hat\theta_i(\gamma)$ for a response pattern $y_i$ , together with its gradient $g_i = \partial \hat\theta_i / \partial \gamma$ .
The propagated risk $g_i' \Sigma_\gamma(\lambda)\, g_i$ , averaged over a target population.

Tuning chooses $\lambda$ to minimize that average. The sections below build up each link in turn.

1. The estimator and its estimating equation

The mixed-subjects estimator minimizes a PPI++-style combined objective over human (observed), paired-predicted, and generated responses,

$L_\lambda(\gamma) \;=\; L_O(\gamma) \;+\; \lambda\,\big[L_G(\gamma) - L_P(\gamma)\big],$

where each $L$ is a marginal (or Bock–Aitkin expected-count) negative log-likelihood. The estimator $\hat\gamma(\lambda)$ solves the estimating equation

$\Psi_\lambda(\gamma) \;=\; \psi_O(\gamma) + \lambda\,\big[\psi_G(\gamma) - \psi_P(\gamma)\big] \;=\; 0, \qquad \psi = \nabla_\gamma L.$

Setting $\lambda = 0$ recovers the human-only calibration; $\lambda > 0$ borrows strength from the LLM responses while the $-\psi_P$ term de-biases that contribution at the population level (the PPI++ correction). The per-person score contributions to $\psi$ are, for item $j$ ,

$s_{ij}^{a} = (y_{ij} - \bar p_{ij})\,\bar\theta_i, \qquad s_{ij}^{d} = (y_{ij} - \bar p_{ij}),$

evaluated under the posterior over $\theta$ .

2. The sandwich covariance of $\hat\gamma$

$\hat\gamma(\lambda)$ has asymptotic covariance of the form

$\Sigma_\gamma(\lambda) \;=\; A_\lambda^{-1}\, B_\lambda\, A_\lambda^{-1},$

with bread $A_\lambda = \mathbb{E}[\nabla_\gamma \Psi_\lambda]$ and meat $B_\lambda = \operatorname{Cov}(\Psi_\lambda)$ .

Bread. Normally we would combine the three Hessians block by block,

$A_\lambda \;=\; H_O + \lambda\,\big(H_G - H_P \big),$

where each $H$ is the appropriate Hessian. Since he MML estimator marginalizes over $\theta$ , its bread must use Louis’s (1982) observed-information identity $A^{\text{marg}} = H^{\text{comp}} - I^{\text{miss}}$ , subtracting the missing information louis_missing_info() from the complete-data Hessian. Using the complete-data Hessian alone would overstate efficiency. This is what the package computes by default.

Meat. The meat is the covariance of the labeled correction plus the independent generated contribution,

$B_\lambda \;=\; \frac{1}{n}\operatorname{Cov}\!\big(S_{\text{obs}} - \lambda S_{\text{pred}}\big) \;+\; \frac{\lambda^2}{N}\operatorname{Cov}\!\big(S_{\text{gen}}\big),$

with $n$ labeled and $N$ generated subjects. The vcov() S3 method dispatches automatically.

3. Ability scoring and the implicit gradient

Given item parameters $\gamma$ , the bounded maximum-likelihood ability estimate for response pattern $y_i$ solves the scoring equation

$S(\theta;\gamma, y_i) \;=\; \sum_{j} a_j\,\big(y_{ij} - p_j(\theta)\big) \;=\; 0,$

which score_theta() finds by 1-D optimization on the interval bounds. The risk machinery needs the sensitivity of that solution $\hat\theta_i$ to the item parameters, $g_i = \partial \hat\theta_i / \partial \gamma$ . Because $\hat\theta_i$ is defined implicitly by $S(\hat\theta_i; \gamma) = 0$ , the implicit function theorem gives

$\frac{\partial \hat\theta_i}{\partial \gamma_k} \;=\; -\,\Big(\frac{\partial S}{\partial \theta}\Big)^{-1} \frac{\partial S}{\partial \gamma_k}.$

The denominator is the (negative) test information at $\hat\theta_i$ ,

$\frac{\partial S}{\partial \theta} \;=\; -\sum_j a_j^2\, p_j(1 - p_j),$

and the numerators, for the discrimination and intercept of item $j$ , are

$\frac{\partial S}{\partial a_j} = (y_{ij} - p_j) - a_j\, p_j(1 - p_j)\,\hat\theta_i, \qquad \frac{\partial S}{\partial d_j} = -\,a_j\, p_j(1 - p_j).$

Where the gradient is undefined. The implicit-function argument requires an interior optimum. At a boundary estimate (all-correct or all-incorrect patterns push $\hat\theta_i$ to a bound), the score equation does not hold and the gradient is theoretically undefined; ability_gradient() returns NA for those rows, and they drop out of the risk average via na.rm = TRUE. Rows with vanishing test information $(|\partial S/\partial\theta| < \varepsilon)$ are treated the same way.

4. Delta-method propagation and the risk

With $g_i$ in hand, the delta method propagates item-parameter uncertainty into the score:

$\rho_i(\lambda) = \operatorname{Var}\big(\hat\theta_i\big) \;\approx\; g_i'\, \Sigma_\gamma(\lambda)\, g_i.$

Averaging over a target population of $M$ response patterns gives the scalar objective that the tuners minimize,

$R(\lambda) \;=\; \frac{1}{M}\sum_{i=1}^{M} \rho_i(\lambda) \;=\; \mathbb{E}_{\text{target}}\!\big[\, g'\,\Sigma_\gamma(\lambda)\, g \,\big].$

The expectation is over the target population’s ability distribution, which is why target_resp matters: tuning for the observed calibration sample (target_resp = observed) and tuning for a separate operational scoring population generally give different $\lambda$ .

5. Why this differs from the PPI++ trace objective

The original PPI++ tuning rule minimizes the trace of the item-parameter covariance,

$\lambda^{\star}_{\text{PPI}} = \arg\min_\lambda \operatorname{Tr}\big[\Sigma_\gamma(\lambda)\big],$

which tune_lambda_ppi_score() evaluates in closed form. Writing $\Sigma_\gamma = (\sigma_{kl})$ , the two objectives expand as

$\operatorname{Tr}(\Sigma_\gamma) = \sum_k \sigma_{kk}, \qquad g'\Sigma_\gamma g = \sum_{k} g_k^2\,\sigma_{kk} + \sum_{k \neq l} g_k g_l\,\sigma_{kl}.$

The trace sees only the diagonal variances and weights every parameter equally. The ability risk weights each variance by $g_k^2$ (how much that particular parameter moves the score) and uses the off-diagonal covariances $\sigma_{kl}$ , which encode the scale/identification structure of the 2PL. Errors in $a_j$ and $d_j$ that are correlated in a direction that leaves $\hat\theta$ unchanged are penalized by the trace but (correctly) ignored by the ability risk. The two criteria therefore select different $\lambda$ in general; use the ability risk for operational scoring and the trace as a theoretical diagnostic.

Summary

Symbol	Meaning	Computed by
$\hat\gamma(\lambda)$	Mixed-subjects item parameters	`fit_mixed_subjects_mml()`
$\Sigma_\gamma(\lambda)$	Sandwich covariance of $\hat\gamma$	`vcov()` → `vcov_mixed_subjects_mml()`
$\hat\theta_i$	Bounded ML ability score	`score_theta()`
$g_i = \partial\hat\theta_i/\partial\gamma$	Implicit ability gradient	`ability_gradient()`
$g_i'\Sigma_\gamma g_i$	Propagated score variance	`ability_risk()`
$R(\lambda) = \mathbb{E}[g'\Sigma_\gamma g]$	Ability-risk objective	`tune_lambda_ability_risk()`

For the practical workflow built on these pieces — cross-fitting, target populations, and the choice between estimators — see the Choosing Lambda vignette.

Why this vignette exists

Key Intuition

The three response matrices

1. Observed human responses: OO

2. Paired LLM-predicted human responses: PP

3. Additional LLM-generated responses: GG

The mixed-subjects IRT objective

What lambda is learning

Ability-risk tuning

Why row alignment matters

Case A: perfect paired prediction

Case B: row-shuffled perfect predictions

Case C: same DGP, fresh Bernoulli draw

What kind of LLM data produces higher lambda?

One approach to row alignment: leave-one-item-out prediction

Another approach: covariate-based prediction

Something that probably won’t work: item-text-only generation

How to generate GG

Summary

Technical Explanation

Overview: four objects, one objective

1. The estimator and its estimating equation

2. The sandwich covariance of γ̂\hat\gamma

3. Ability scoring and the implicit gradient

4. Delta-method propagation and the risk

5. Why this differs from the PPI++ trace objective

Summary

1. Observed human responses: $O$

2. Paired LLM-predicted human responses: $P$

3. Additional LLM-generated responses: $G$

How to generate $G$

2. The sandwich covariance of $\hat\gamma$