Skip to contents

Why this vignette exists

This vignette is meant to alleviate some common misunderstandings of mixed-subjects IRT calibration. While the core idea of augmenting human data with LLM-generated data and estimating a λ\lambda value to tune how much you want the LLM-generated to contribute is straightforward, the nuances of the ability-risk objective are not. In this vignette, we will go into the data requirements for mixed-subject calibrations and the ability-risk objective itself to help set users up for success. If there is one major takeaway from this document, it is that the tuning parameter λ\lambda is not asking whether LLM-generated responses look human in the aggregate. It is, instead, asking whether an LLM-based response-generation procedure can predict the row-level human response structure enough to reduce downstream ability-estimation error. The most important object is therefore the paired prediction matrix, PP.

Key Intuition

The three response matrices

Mixed-subjects IRT requires three item response matrices. Let JJ be the number of items, nn be the number of observed human respondents, and NN be the number of additional generated respondents.

1. Observed human responses: OO

This is the real human pilot calibration response matrix, with structure

O{0,1}n×J.O \in \{0,1\}^{n \times J}.

Each row is the full observed response string from one human respondent. Each column corresponds to an item. Each entry is an observed, dichotomously scored, response.

2. Paired LLM-predicted human responses: PP

This is the LLM-predicted response matrix for the same human respondents in OO, with structure

P{0,1}n×J.P \in \{0,1\}^{n \times J}.

Pay special attention to the requirement of “same human respondents.” Row ii in PP must be predicted responses for the respondent in row ii in OO. Column jj in PP must correspond to column jj in OO. This is the diagnostic matrix. It tells the method whether the LLM response-generation procedure is informative about the human response process and defines the magnitude and confidence in the mixed-subjects correction.

Importantly, this means that when generating PP, users need to transmit both the content of the items being responded to and some sort of information about the rows of OO to the LLM generating the predicted responses. This can be in the form of narrative or covariate information about the respondents, held out item responses, or something else.

3. Additional LLM-generated responses: GG

This is the larger synthetic or LLM-generated response matrix, with structure

G{0,1}N×J.G \in \{0,1\}^{N \times J}.

Typically, NnN \gg n to maximize the potential improvement in post-calibration ability estimation precision. The rows in GG are not paired with rows in OO, but instead additional generated respondents. However, the crucial requirement is that GG is meant to be sampled from the same distribution as PP, meaning it should be created using a procedure that is as close as possible to the procedure used to create PP and some amount of information about the ability distribution in PP.

A useful way to remember the design is:

Matrix Shape Rows Purpose
OO n×Jn \times J observed human respondents anchor the calibration target to the human population
PP n×Jn \times J LLM predictions for those same human respondents estimate how the LLM procedure agrees with or deviates from human responses
GG N×JN \times J additional LLM-generated rows provide extra precision derived from synthetic information after correction

Another helpful rule of thumb is that if PP is not row-aligned with OO, you should expect λ0\lambda \to 0.

The mixed-subjects IRT objective

The recommended scalar-λ\lambda workflow in mixedsubjectsirt uses a marginal maximum likelihood (MML) objective of the form

γ̂λ=argminγ[LOmarg(γ)+λ(LGmarg(γ)LPmarg(γ))].\hat\gamma_\lambda =\arg\min_\gamma\big[L_O^{\mathrm{marg}}(\gamma)+\lambda\big(L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\big)\big].

Here γ={a1,,aJ,d1,,dJ}\gamma = \{a_1, \ldots, a_J, d_1, \ldots, d_J \} is the vector of item parameters. For a 2PL model, we write item jj’s response probability as

pj(θ;γj)=logit1(dj+ajθ),p_j(\theta;\gamma_j) =\mathrm{logit}^{-1}(d_j + a_j \theta),

where djd_j is the intercept and aja_j is the discrimination.

The three pieces of the objective each have different jobs in the mixed-subjects loss: LOmarg(γ)L_O^{\mathrm{marg}}(\gamma) is the usual human-data marginal IRT likelihood evaluated at γ\gamma. This term anchors the calibration to humans. LGmarg(γ)L_G^{\mathrm{marg}}(\gamma) is the likelihood contribution from the additional generated response rows. LPmarg(γ)L_P^{\mathrm{marg}}(\gamma) is the correction term. It estimates what the same LLM-generation procedure says about the humans whose actual responses are observed.

The term LGmargLPmargL_G^{\mathrm{marg}} - L_P^{\mathrm{marg}} is the prediction-powered correction. The generated matrix GG adds information, while the paired prediction matrix PP allows us to estimate the LLM procedure’s bias, noise, and covariance with the human response process.

What lambda is learning

The tuning parameter λ\lambda weights how much the LLM-generated responses are allowed to contribute to parameter estimation. When λ=0\lambda = 0,

LOmarg(γ)+λ{LGmarg(γ)LPmarg(γ)}=LOmarg(γ), L_O^{\mathrm{marg}}(\gamma) + \lambda \{L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\} = L_O^{\mathrm{marg}}(\gamma),

so the method falls back to human-only calibration. When λ>0\lambda > 0, the method borrows information from GG, corrected by PP.

This means λ=0\lambda=0 is not a failure of the software, instead it is a protective outcome that occurs when the LLM responses do not reduce expected ability-estimation error. A high λ\lambda requires more than plausible synthetic response rows. It requires the paired LLM predictions in PP to track the human responses in OO in a way that is consistent across respondents.

Ability-risk tuning

There are two related (but distinct) tuning ideas in the package. The original PPI++ paper minimizes the standard errors of the estimated parameters (the trace of the covariance matrix). The function tune_lambda_ppi_score() implements this.

The function tune_lambda_ability_risk() asks a more practical psychometric question: Which value of λ\lambda minimizes expected downstream ability-estimation error?

The approximate target is R̂(λ)=1Mm=1MgmΣ̂γ,λgm, \widehat R(\lambda) = \frac{1}{M} \sum_{m=1}^{M} g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m, where:

  • ymy_m is a target response pattern;
  • θ̂(ym;γ̂λ)\hat\theta(y_m;\hat\gamma_\lambda) is the ability estimate produced from that response pattern;
  • gm=γθ̂(ym;γ̂λ)g_m = \nabla_\gamma \hat\theta(y_m;\hat\gamma_\lambda) is gradient of the ability estimate, which captures the sensitivity of the ability estimate to item-parameter error;
  • Σ̂γ,λ\widehat\Sigma_{\gamma,\lambda} is the full covariance matrix of the item-parameter estimates under the mixed-subjects fit.

gmΣ̂γ,λgmg_m^\top \widehat\Sigma_{\gamma,\lambda} g_m thus propagates item-parameter uncertainty and covariance structure into ability-estimation uncertainty.

This matters because ability-risk tuning is not the same as minimizing average item-parameter standard errors. The off-diagonal elements of the item-parameter covariance matrix now matter, describing how uncertainty in one item parameter moves with uncertainty in another. Some covariance patterns may cancel out for ability scoring; others may distort the scale in high-information regions. Ability-risk tuning weights these covariance patterns by their downstream impact on θ̂\hat\theta, averaged over an expected ability distribution.

Why row alignment matters

The easiest way to understand λ\lambda is to compare three cases.

Case A: perfect paired prediction

Suppose

P=O. P = O.

Here, the paired LLM prediction is exactly equal to the human response matrix. From the perspective of maximizing the contribution of GG to estimation, this is the best possible version of PP. In this case, λ\lambda should be large, though not necessarily λ=1\lambda = 1. The finite-NN benchmark is

λmax=11+n/N=Nn+N. \lambda_{\max} = \frac{1}{1+n/N} = \frac{N}{n+N}.

If n=400n=400 and N=1200N=1200, we see

λmax=1200400+1200=0.75. \lambda_{\max} = \frac{1200}{400+1200}=0.75.

So even when P=OP=O, if N≫̸nN \not\gg n, λ<1\lambda < 1 should be expected.

Case B: row-shuffled perfect predictions

Now suppose

P=shuffle_rows(O). P = \mathrm{shuffle\_rows}(O).

The marginal item means are identical. The total-score distribution is identical. The item difficulty information is identical. The only thing that’s changed is row ii in PP is no longer a prediction for row ii in OO. This should produce λ0\lambda \approx 0, because the row-aligned covariance structure has been destroyed. To observe this for both λ\lambda tuning objectives, you can run:

predicted_shuffled <- observed[sample(nrow(observed)), ]

lambda_shuffled_ppi <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ppi$lambda

lambda_shuffled_ability_risk <- tune_lambda_ability_risk(
  observed    = observed,
  predicted   = predicted_shuffled,
  item_pars   = human_fit$pars,
  n_generated = nrow(generated)
)

lambda_shuffled_ability_risk$lambda

Case C: same DGP, fresh Bernoulli draw

Suppose both OO and PP are generated from the same IRT model:

OijθiBernoulli{pj(θi)}, O_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\},

PijθiBernoulli{pj(θi)}. P_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\}.

This is “same DGP,” but PP is still a fresh stochastic response. It is not the same as OO. The two matrices share person ability and item parameters, but they do not share response noise.

For a single item,

Cov(Oij,Pij)=Varθ[pj(θ)]. \operatorname{Cov}(O_{ij},P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)].

But

Var(Pij)=Varθ[pj(θ)]+𝔼θ[pj(θ){1pj(θ)}]. \operatorname{Var}(P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)] + \mathbb{E}_\theta[p_j(\theta)\{1-p_j(\theta)\}].

The Bernoulli noise in PP dilutes the control-variate signal. As a result, a fresh same-DGP draw may produce a modest λ\lambda, not a large one. This distinction is important: Merely producing the same item parameters does not imply strong paired prediction.

What kind of LLM data produces higher lambda?

A useful PP matrix has to predict row-level response structure. A good PP should have these properties:

  1. Same rows as OO: row ii in PP predicts row ii in OO.
  2. Same item columns as OO.
  3. Target response not leaked: when predicting PijP_{ij}, the prompt must not include OijO_{ij}.
  4. Construct-relevant respondent information: covariates or context should help infer the respondent’s likely response.
  5. Within-person response structure: the LLM should be able to infer something about the respondent and their knowledge or ability level from other responses or covariates.
  6. Same procedure for PP and GG: the generated matrix should be produced by an analogous procedure to the paired predicted matrix.

One approach to row alignment: leave-one-item-out prediction

When you have a human response matrix OO, an approach to build PP is leave-one-item-out response prediction. For each respondent ii and item jj:

  1. Mask response OijO_{ij}.
  2. Give the LLM the text and responses for the other items Oi,jO_{i,-j}.
  3. Give the LLM the text for item jj.
  4. Ask it to predict the response to item jj.
  5. Store the result as PijP_{ij}.

Another approach: covariate-based prediction

If you have construct-relevant covariates, you can build PP without using other item responses or augment the LOO-response prediction approach outlined above.

Example covariates include:

  • grade level;
  • age;
  • prior achievement;
  • language background;
  • prior placement scores;
  • response-time or engagement indicators;
  • classroom, school, or instructional context;
  • demographic variables, where appropriate and ethically justified.

Something that probably won’t work: item-text-only generation

Item-text-only generation usually predicts column properties (like item parameters and relative item spacings), not row-aligned responses. Row-aligned responses are important because they allow the method to link the underlying ability distributions between human respondents and LLM-generated respondents. This is why an item-text-only approach may produce synthetic data that looks plausible in aggregate, but still produces λ=0\lambda=0.

How to generate GG

The generated matrix GG should be produced using a procedure that mirrors the procedure used to produce PP.

If PP is generated with leave-one-item-out prompts, then GG should also be generated with leave-one-item-out or masked-item prompts.

One possible procedure:

  1. Sample or resample a respondent profile.
  2. Create a partial response context.
  3. Mask one target item.
  4. Ask the LLM to predict the masked response.
  5. Repeat until a full generated response row is built.

The most important rule is that GG and PP should be generated by the same prediction mechanism. If PP is generated with row-aligned covariates and response history, but GG is generated by asking the LLM to invent full response strings from scratch, then LGmargLPmargL_G^{\mathrm{marg}} - L_P^{\mathrm{marg}} may not be zero in expectation, producing asymptotically biased parameter estimates.

Summary

Mixed-subjects IRT is useful when the LLM response-generation procedure captures respondent-level response structure. The generated matrix GG matters, but the paired matrix PP is what lets the method learn whether GG should be trusted. The key understanding is that GG supplies extra synthetic rows, but PP tells us how much to trust them.

If PP is row-aligned and predictive of OO, λ\lambda can be positive and the generated data can improve calibration. If PP only reproduces marginal item difficulty, or if its rows are not aligned with OO, then λ\lambda should shrink toward zero. This is a feature, not a bug.

Technical Explanation

The Choosing Lambda vignette explains which tuning function to use and when. This section derives the mathematics those functions implement: how item-parameter uncertainty is propagated into ability scores, and why minimizing the resulting ability risk is a fundamentally different objective from the original PPI++ trace criterion.

Throughout, let γ={a1,,aJ,d1,,dJ}\gamma = \{a_1, \dots, a_J, d_1, \dots, d_J\} collect the 2J2J item parameters of a JJ-item 2PL model, ordered as all discriminations followed by all intercepts. This is the ordering convention used by package functions like fit$par, vcov(), and ability_gradient(). The item response function is again

pj(θ;γj)=Pr(Yj=1θ)=logit1(dj+ajθ). p_{j}(\theta;\gamma_j) \;=\; \Pr(Y_j = 1 \mid \theta) \;=\; \operatorname{logit}^{-1}\!\big(d_j + a_j \theta\big).

Overview: four objects, one objective

Ability-risk tuning chains four quantities together:

  1. The mixed-subjects estimator γ̂(λ)\hat\gamma(\lambda), a function of the tuning parameter λ\lambda.
  2. Its sandwich covariance Σγ(λ)=Cov(γ̂)\Sigma_\gamma(\lambda) = \operatorname{Cov}(\hat\gamma).
  3. The ability estimate θ̂i(γ)\hat\theta_i(\gamma) for a response pattern yiy_i, together with its gradient gi=θ̂i/γg_i = \partial \hat\theta_i / \partial \gamma.
  4. The propagated risk giΣγ(λ)gig_i' \Sigma_\gamma(\lambda)\, g_i, averaged over a target population.

Tuning chooses λ\lambda to minimize that average. The sections below build up each link in turn.

1. The estimator and its estimating equation

The mixed-subjects estimator minimizes a PPI++-style combined objective over human (observed), paired-predicted, and generated responses,

Lλ(γ)=LO(γ)+λ[LG(γ)LP(γ)], L_\lambda(\gamma) \;=\; L_O(\gamma) \;+\; \lambda\,\big[L_G(\gamma) - L_P(\gamma)\big],

where each LL is a marginal (or Bock–Aitkin expected-count) negative log-likelihood. The estimator γ̂(λ)\hat\gamma(\lambda) solves the estimating equation

Ψλ(γ)=ψO(γ)+λ[ψG(γ)ψP(γ)]=0,ψ=γL. \Psi_\lambda(\gamma) \;=\; \psi_O(\gamma) + \lambda\,\big[\psi_G(\gamma) - \psi_P(\gamma)\big] \;=\; 0, \qquad \psi = \nabla_\gamma L.

Setting λ=0\lambda = 0 recovers the human-only calibration; λ>0\lambda > 0 borrows strength from the LLM responses while the ψP-\psi_P term de-biases that contribution at the population level (the PPI++ correction). The per-person score contributions to ψ\psi are, for item jj,

sija=(yijpij)θi,sijd=(yijpij), s_{ij}^{a} = (y_{ij} - \bar p_{ij})\,\bar\theta_i, \qquad s_{ij}^{d} = (y_{ij} - \bar p_{ij}),

evaluated under the posterior over θ\theta.

2. The sandwich covariance of γ̂\hat\gamma

γ̂(λ)\hat\gamma(\lambda) has asymptotic covariance of the form

Σγ(λ)=Aλ1BλAλ1, \Sigma_\gamma(\lambda) \;=\; A_\lambda^{-1}\, B_\lambda\, A_\lambda^{-1},

with bread Aλ=𝔼[γΨλ]A_\lambda = \mathbb{E}[\nabla_\gamma \Psi_\lambda] and meat Bλ=Cov(Ψλ)B_\lambda = \operatorname{Cov}(\Psi_\lambda).

Bread. Normally we would combine the three Hessians block by block,

Aλ=HO+λ(HGHP), A_\lambda \;=\; H_O + \lambda\,\big(H_G - H_P \big),

where each HH is the appropriate Hessian. Since he MML estimator marginalizes over θ\theta, its bread must use Louis’s (1982) observed-information identity Amarg=HcompImissA^{\text{marg}} = H^{\text{comp}} - I^{\text{miss}}, subtracting the missing information louis_missing_info() from the complete-data Hessian. Using the complete-data Hessian alone would overstate efficiency. This is what the package computes by default.

Meat. The meat is the covariance of the labeled correction plus the independent generated contribution,

Bλ=1nCov(SobsλSpred)+λ2NCov(Sgen), B_\lambda \;=\; \frac{1}{n}\operatorname{Cov}\!\big(S_{\text{obs}} - \lambda S_{\text{pred}}\big) \;+\; \frac{\lambda^2}{N}\operatorname{Cov}\!\big(S_{\text{gen}}\big),

with nn labeled and NN generated subjects. The vcov() S3 method dispatches automatically.

3. Ability scoring and the implicit gradient

Given item parameters γ\gamma, the bounded maximum-likelihood ability estimate for response pattern yiy_i solves the scoring equation

S(θ;γ,yi)=jaj(yijpj(θ))=0, S(\theta;\gamma, y_i) \;=\; \sum_{j} a_j\,\big(y_{ij} - p_j(\theta)\big) \;=\; 0,

which score_theta() finds by 1-D optimization on the interval bounds. The risk machinery needs the sensitivity of that solution θ̂i\hat\theta_i to the item parameters, gi=θ̂i/γg_i = \partial \hat\theta_i / \partial \gamma. Because θ̂i\hat\theta_i is defined implicitly by S(θ̂i;γ)=0S(\hat\theta_i; \gamma) = 0, the implicit function theorem gives

θ̂iγk=(Sθ)1Sγk. \frac{\partial \hat\theta_i}{\partial \gamma_k} \;=\; -\,\Big(\frac{\partial S}{\partial \theta}\Big)^{-1} \frac{\partial S}{\partial \gamma_k}.

The denominator is the (negative) test information at θ̂i\hat\theta_i,

Sθ=jaj2pj(1pj), \frac{\partial S}{\partial \theta} \;=\; -\sum_j a_j^2\, p_j(1 - p_j),

and the numerators, for the discrimination and intercept of item jj, are

Saj=(yijpj)ajpj(1pj)θ̂i,Sdj=ajpj(1pj). \frac{\partial S}{\partial a_j} = (y_{ij} - p_j) - a_j\, p_j(1 - p_j)\,\hat\theta_i, \qquad \frac{\partial S}{\partial d_j} = -\,a_j\, p_j(1 - p_j).

Where the gradient is undefined. The implicit-function argument requires an interior optimum. At a boundary estimate (all-correct or all-incorrect patterns push θ̂i\hat\theta_i to a bound), the score equation does not hold and the gradient is theoretically undefined; ability_gradient() returns NA for those rows, and they drop out of the risk average via na.rm = TRUE. Rows with vanishing test information (|S/θ|<ε)(|\partial S/\partial\theta| < \varepsilon) are treated the same way.

4. Delta-method propagation and the risk

With gig_i in hand, the delta method propagates item-parameter uncertainty into the score:

ρi(λ)=Var(θ̂i)giΣγ(λ)gi. \rho_i(\lambda) = \operatorname{Var}\big(\hat\theta_i\big) \;\approx\; g_i'\, \Sigma_\gamma(\lambda)\, g_i.

Averaging over a target population of MM response patterns gives the scalar objective that the tuners minimize,

R(λ)=1Mi=1Mρi(λ)=𝔼target[gΣγ(λ)g]. R(\lambda) \;=\; \frac{1}{M}\sum_{i=1}^{M} \rho_i(\lambda) \;=\; \mathbb{E}_{\text{target}}\!\big[\, g'\,\Sigma_\gamma(\lambda)\, g \,\big].

The expectation is over the target population’s ability distribution, which is why target_resp matters: tuning for the observed calibration sample (target_resp = observed) and tuning for a separate operational scoring population generally give different λ\lambda.

5. Why this differs from the PPI++ trace objective

The original PPI++ tuning rule minimizes the trace of the item-parameter covariance,

λPPI=argminλTr[Σγ(λ)], \lambda^{\star}_{\text{PPI}} = \arg\min_\lambda \operatorname{Tr}\big[\Sigma_\gamma(\lambda)\big],

which tune_lambda_ppi_score() evaluates in closed form. Writing Σγ=(σkl)\Sigma_\gamma = (\sigma_{kl}), the two objectives expand as

Tr(Σγ)=kσkk,gΣγg=kgk2σkk+klgkglσkl. \operatorname{Tr}(\Sigma_\gamma) = \sum_k \sigma_{kk}, \qquad g'\Sigma_\gamma g = \sum_{k} g_k^2\,\sigma_{kk} + \sum_{k \neq l} g_k g_l\,\sigma_{kl}.

The trace sees only the diagonal variances and weights every parameter equally. The ability risk weights each variance by gk2g_k^2 (how much that particular parameter moves the score) and uses the off-diagonal covariances σkl\sigma_{kl}, which encode the scale/identification structure of the 2PL. Errors in aja_j and djd_j that are correlated in a direction that leaves θ̂\hat\theta unchanged are penalized by the trace but (correctly) ignored by the ability risk. The two criteria therefore select different λ\lambda in general; use the ability risk for operational scoring and the trace as a theoretical diagnostic.

Summary

Symbol Meaning Computed by
γ̂(λ)\hat\gamma(\lambda) Mixed-subjects item parameters fit_mixed_subjects_mml()
Σγ(λ)\Sigma_\gamma(\lambda) Sandwich covariance of γ̂\hat\gamma vcov()vcov_mixed_subjects_mml()
θ̂i\hat\theta_i Bounded ML ability score score_theta()
gi=θ̂i/γg_i = \partial\hat\theta_i/\partial\gamma Implicit ability gradient ability_gradient()
giΣγgig_i'\Sigma_\gamma g_i Propagated score variance ability_risk()
R(λ)=𝔼[gΣγg]R(\lambda) = \mathbb{E}[g'\Sigma_\gamma g] Ability-risk objective tune_lambda_ability_risk()

For the practical workflow built on these pieces — cross-fitting, target populations, and the choice between estimators — see the Choosing Lambda vignette.