正规化的贝叶斯校准和WD-FAB IRT模型的评分可提高边际最大似然的预测性能

论文标题

正规化的贝叶斯校准和WD-FAB IRT模型的评分可提高边际最大似然的预测性能

Regularized Bayesian calibration and scoring of the WD-FAB IRT model improves predictive performance over marginal maximum likelihood

论文作者

Chang, Joshua C., Porcino, Julia, Rasch, Elizabeth K., Tang, Larry

论文摘要

项目反应理论（IRT）是用于测试反应的主要生成概率模型家族的统计范式，用于量化个人相对于目标人群的特征。分级响应模型（GRM）是一种特定的IRT模型，用于有序的多态测试响应。 GRM和其他IRT模型的开发和应用都需要统计决策。为了制定这些模型（校准），需要决定项目选择，推理和正则化的方法。为了应用这些模型（测试评分），需要做出类似的决策，通常优先考虑计算障碍性和/或解释性。在许多应用中，例如在工作障碍功能评估电池（WD-FAB）中，拖延性意味着使用均值和方差估计值近似个人的得分分布，并仅根据校准模型的点估计值获得该分数。在此手稿中，我们使用贝叶斯交叉验证评估了该通用用例下模型的校准和评分。我们适用于为国立卫生研究院收集的WD-FAB响应，我们根据其产量的能力（根据受访者的验证集）评估了GRM实施的预测能力，最可预测项目响应模式的能力估计值。我们的主要发现表明，正规化GRM的贝叶斯校准优于无正则经验贝叶斯的边际最大似然性。我们还激励在测试评分中使用紧凑的支持者。

Item response theory (IRT) is the statistical paradigm underlying a dominant family of generative probabilistic models for test responses, used to quantify traits in individuals relative to target populations. The graded response model (GRM) is a particular IRT model that is used for ordered polytomous test responses. Both the development and the application of the GRM and other IRT models require statistical decisions. For formulating these models (calibration), one needs to decide on methodologies for item selection, inference, and regularization. For applying these models (test scoring), one needs to make similar decisions, often prioritizing computational tractability and/or interpretability. In many applications, such as in the Work Disability Functional Assessment Battery (WD-FAB), tractability implies approximating an individual's score distribution using estimates of mean and variance, and obtaining that score conditional on only point estimates of the calibrated model. In this manuscript, we evaluate the calibration and scoring of models under this common use-case using Bayesian cross-validation. Applied to the WD-FAB responses collected for the National Institutes of Health, we assess the predictive power of implementations of the GRM based on their ability to yield, on validation sets of respondents, ability estimates that are most predictive of patterns of item responses. Our main finding indicates that regularized Bayesian calibration of the GRM outperforms the regularization-free empirical Bayesian procedure of marginal maximum likelihood. We also motivate the use of compactly supported priors in test scoring.

下载PDF全文

下载文献需遵守相关版权规定

论文标题