scholarly journals Summed Score Likelihood–Based Indices for Testing Latent Variable Distribution Fit in Item Response Theory

2017 ◽  
Vol 78 (5) ◽  
pp. 857-886 ◽  
Author(s):  
Zhen Li ◽  
Li Cai

In standard item response theory (IRT) applications, the latent variable is typically assumed to be normally distributed. If the normality assumption is violated, the item parameter estimates can become biased. Summed score likelihood–based statistics may be useful for testing latent variable distribution fit. We develop Satorra–Bentler type moment adjustments to approximate the test statistics’ tail-area probability. A simulation study was conducted to examine the calibration and power of the unadjusted and adjusted statistics in various simulation conditions. Results show that the proposed indices have tail-area probabilities that can be closely approximated by central chi-squared random variables under the null hypothesis. Furthermore, the test statistics are focused. They are powerful for detecting latent variable distributional assumption violations, and not sensitive (correctly) to other forms of model misspecification such as multidimensionality. As a comparison, the goodness-of-fit statistic M2 has considerably lower power against latent variable nonnormality than the proposed indices. Empirical data from a patient-reported health outcomes study are used as illustration.

2019 ◽  
Vol 45 (4) ◽  
pp. 383-402
Author(s):  
Paul A. Jewsbury ◽  
Peter W. van Rijn

In large-scale educational assessment data consistent with a simple-structure multidimensional item response theory (MIRT) model, where every item measures only one latent variable, separate unidimensional item response theory (UIRT) models for each latent variable are often calibrated for practical reasons. While this approach can be valid for data from a linear test, unacceptable item parameter estimates are obtained when data arise from a multistage test (MST). We explore this situation from a missing data perspective and show mathematically that MST data will be problematic for calibrating multiple UIRT models but not MIRT models. This occurs because some items that were used in the routing decision are excluded from the separate UIRT models, due to measuring a different latent variable. Both simulated and real data from the National Assessment of Educational Progress are used to further confirm and explore the unacceptable item parameter estimates. The theoretical and empirical results confirm that only MIRT models are valid for item calibration of multidimensional MST data.


2017 ◽  
Vol 43 (3) ◽  
pp. 259-285 ◽  
Author(s):  
Yang Liu ◽  
Ji Seung Yang

The uncertainty arising from item parameter estimation is often not negligible and must be accounted for when calculating latent variable (LV) scores in item response theory (IRT). It is particularly so when the calibration sample size is limited and/or the calibration IRT model is complex. In the current work, we treat two-stage IRT scoring as a predictive inference problem: The target of prediction is a random variable that follows the true posterior of the LV conditional on the response pattern being scored. Various Bayesian, fiducial, and frequentist prediction intervals of LV scores, which can be obtained from a simple yet generic Monte Carlo recipe, are evaluated and contrasted via simulations based on several measures of prediction quality. An empirical data example is also presented to illustrate the use of candidate methods.


Author(s):  
Riswan Riswan

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).


2020 ◽  
pp. 107699862095376
Author(s):  
Scott Monroe

This research proposes a new statistic for testing latent variable distribution fit for unidimensional item response theory (IRT) models. If the typical assumption of normality is violated, then item parameter estimates will be biased, and dependent quantities such as IRT score estimates will be adversely affected. The proposed statistic compares the specified latent variable distribution to the sample average of latent variable posterior distributions commonly used in IRT scoring. Formally, the statistic is an instantiation of a generalized residual and is thus asymptotically distributed as standard normal. Also, the statistic naturally complements residual-based item-fit statistics, as both are conditional on the latent trait, and can be presented with graphical plots. In addition, a corresponding unconditional statistic, which controls for multiple comparisons, is proposed. The statistics are evaluated using a simulation study, and empirical analyses are provided.


2020 ◽  
pp. 001316442094989
Author(s):  
Joseph A. Rios ◽  
James Soland

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.


1992 ◽  
Vol 17 (2) ◽  
pp. 155-173 ◽  
Author(s):  
Kentaro Yamamoto ◽  
John Mazzeo

In educational assessments, it is often necessary to compare the performance of groups of individuals who have been administered different forms of a test. If these groups are to be validly compared, all results need to be expressed on a common scale. When assessment results are to be reported using an item response theory (IRT) proficiency metric, as is done for the National Assessment of Educational Progress (NAEP), establishing a common metric becomes synonymous with expressing IRT item parameter estimates on a common scale. Procedures that accomplish this are referred to here as scale linking procedures. This chapter discusses the need for scale linking in NAEP and illustrates the specific procedures used to carry out the linking in the context of the major analyses conducted for the 1990 NAEP mathematics assessment.


2018 ◽  
Author(s):  
Maxwell Hong ◽  
Alison Cheng

Self-report data are common in psychological and survey research. Unfortunately, manyof these samples are plagued with careless responses due to unmotivated participants. Thepurpose of this study is to propose and evaluate a robust estimation method in order to detectcareless, or unmotivated, responders while leveraging Item Response Theory (IRT) person fitstatistics. First, we outline a general framework for robust estimation specific for IRT models.Subsequently, we conduct a simulation study covering multiple conditions to evaluate theperformance of the proposed method. Ultimately, we show how robust maximum marginallikelihood (RMML) estimation significantly improves detection rates for careless responders andreduce bias in item parameters across conditions. Furthermore, we apply our method to a realdataset to illustrate the utility of the proposed method. Our findings suggest that robustestimation coupled with person fit statistics offers a powerful procedure to identify carelessrespondents for further review, and to provide more accurate item parameter estimates inpresence of careless responses.


Psychometrika ◽  
2021 ◽  
Author(s):  
Ron D. Hays ◽  
Karen L. Spritzer ◽  
Steven P. Reise

AbstractThe reliable change index has been used to evaluate the significance of individual change in health-related quality of life. We estimate reliable change for two measures (physical function and emotional distress) in the Patient-Reported Outcomes Measurement Information System (PROMIS®) 29-item health-related quality of life measure (PROMIS-29 v2.1). Using two waves of data collected 3 months apart in a longitudinal observational study of chronic low back pain and chronic neck pain patients receiving chiropractic care, and simulations, we compare estimates of reliable change from classical test theory fixed standard errors with item response theory standard errors from the graded response model. We find that unless true change in the PROMIS physical function and emotional distress scales is substantial, classical test theory estimates of significant individual change are much more optimistic than estimates of change based on item response theory.


Sign in / Sign up

Export Citation Format

Share Document