Use of Restricted Item Response Theory Models for Examining the Stability of Item Parameter Estimates Over Time

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).

Download Full-text

Applications of the Analytically Derived Asymptotic Standard Errors of Item Response Theory Item Parameter Estimates

Journal of Educational Measurement ◽

10.1111/j.1745-3984.2004.tb01109.x ◽

2004 ◽

Vol 41 (2) ◽

pp. 85-117 ◽

Cited By ~ 11

Author(s):

Yuan H. Li ◽

Robert W. Lissitz

Keyword(s):

Item Response Theory ◽

Item Response ◽

Item Parameter ◽

Standard Errors ◽

Parameter Estimates ◽

Response Theory ◽

Asymptotic Standard Errors ◽

Item Parameter Estimates

Download Full-text

IRT and MIRT Models for Item Parameter Estimation With Multidimensional Multistage Tests

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998619881790 ◽

2019 ◽

Vol 45 (4) ◽

pp. 383-402

Author(s):

Paul A. Jewsbury ◽

Peter W. van Rijn

Keyword(s):

Item Response Theory ◽

Item Response ◽

Latent Variable ◽

Large Scale ◽

Real Data ◽

Item Parameter ◽

Practical Reasons ◽

Parameter Estimates ◽

Response Theory ◽

Item Parameter Estimates

In large-scale educational assessment data consistent with a simple-structure multidimensional item response theory (MIRT) model, where every item measures only one latent variable, separate unidimensional item response theory (UIRT) models for each latent variable are often calibrated for practical reasons. While this approach can be valid for data from a linear test, unacceptable item parameter estimates are obtained when data arise from a multistage test (MST). We explore this situation from a missing data perspective and show mathematically that MST data will be problematic for calibrating multiple UIRT models but not MIRT models. This occurs because some items that were used in the routing decision are excluded from the separate UIRT models, due to measuring a different latent variable. Both simulated and real data from the National Assessment of Educational Progress are used to further confirm and explore the unacceptable item parameter estimates. The theoretical and empirical results confirm that only MIRT models are valid for item calibration of multidimensional MST data.

Download Full-text

Chapter 4: Item Response Theory Scale Linking in NAEP

Journal of Educational Statistics ◽

10.3102/10769986017002155 ◽

1992 ◽

Vol 17 (2) ◽

pp. 155-173 ◽

Cited By ~ 4

Author(s):

Kentaro Yamamoto ◽

John Mazzeo

Keyword(s):

Item Response Theory ◽

Item Response ◽

Mathematics Assessment ◽

Item Parameter ◽

Parameter Estimates ◽

Response Theory ◽

Item Parameter Estimates ◽

Common Scale ◽

Educational Assessments ◽

Scale Linking

In educational assessments, it is often necessary to compare the performance of groups of individuals who have been administered different forms of a test. If these groups are to be validly compared, all results need to be expressed on a common scale. When assessment results are to be reported using an item response theory (IRT) proficiency metric, as is done for the National Assessment of Educational Progress (NAEP), establishing a common metric becomes synonymous with expressing IRT item parameter estimates on a common scale. Procedures that accomplish this are referred to here as scale linking procedures. This chapter discusses the need for scale linking in NAEP and illustrates the specific procedures used to carry out the linking in the context of the major analyses conducted for the 1990 NAEP mathematics assessment.

Download Full-text

Assessing Testlet Effect on Parameter Estimates Obtained from Item Response Theory Models

Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi ◽

10.21031/epod.948227 ◽

2021 ◽

Author(s):

Esin YILMAZ KOĞAR

Keyword(s):

Item Response Theory ◽

Item Response ◽

Parameter Estimates ◽

Response Theory ◽

Item Response Theory Models

Download Full-text

Parameter Estimation Accuracy of the Effort-Moderated Item Response Theory Model Under Multiple Assumption Violations

Educational and Psychological Measurement ◽

10.1177/0013164420949896 ◽

2020 ◽

pp. 001316442094989

Author(s):

Joseph A. Rios ◽

James Soland

Keyword(s):

Parameter Estimation ◽

Item Response Theory ◽

Item Response ◽

Item Parameter ◽

Estimation Accuracy ◽

Parameter Estimates ◽

Response Theory ◽

Irt Model ◽

Ability Estimates ◽

Ability Parameter

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.

Download Full-text

A Comparison of the Separate and Concurrent Calibration Methods for the Full-Information Bifactor model

Applied Psychological Measurement ◽

10.1177/0146621618813095 ◽

2018 ◽

Vol 43 (7) ◽

pp. 512-526

Author(s):

Kyung Yong Kim

Keyword(s):

Item Response Theory ◽

Item Response ◽

Calibration Method ◽

Bifactor Model ◽

Item Parameter ◽

Parameter Estimates ◽

Calibration Methods ◽

Item Parameter Estimates ◽

Concurrent Calibration ◽

Separate Calibration

When calibrating items using multidimensional item response theory (MIRT) models, item response theory (IRT) calibration programs typically set the probability density of latent variables to a multivariate standard normal distribution to handle three types of indeterminacies: (a) the location of the origin, (b) the unit of measurement along each coordinate axis, and (c) the orientation of the coordinate axes. However, by doing so, item parameter estimates obtained from two independent calibration runs on nonequivalent groups are on two different coordinate systems. To handle this issue and place all the item parameter estimates on a common coordinate system, a process called linking is necessary. Although various linking methods have been introduced and studied for the full MIRT model, little research has been conducted on linking methods for the bifactor model. Thus, the purpose of this study was to provide detailed descriptions of two separate calibration methods and the concurrent calibration method for the bifactor model and to compare the three linking methods through simulation. In general, the concurrent calibration method provided more accurate linking results than the two separate calibration methods, demonstrating better recovery of the item parameters, item characteristic surfaces, and expected score distribution.

Download Full-text

Robust Maximum Marginal Likelihood (RMML) Estimation for Item Response Theory Models

10.31234/osf.io/v6us8 ◽

2018 ◽

Author(s):

Maxwell Hong ◽

Alison Cheng

Keyword(s):

Item Response Theory ◽

Item Response ◽

Robust Estimation ◽

Marginal Likelihood ◽

Estimation Method ◽

Self Report ◽

Item Parameter ◽

Parameter Estimates ◽

Response Theory ◽

Detection Rates

Self-report data are common in psychological and survey research. Unfortunately, manyof these samples are plagued with careless responses due to unmotivated participants. Thepurpose of this study is to propose and evaluate a robust estimation method in order to detectcareless, or unmotivated, responders while leveraging Item Response Theory (IRT) person fitstatistics. First, we outline a general framework for robust estimation specific for IRT models.Subsequently, we conduct a simulation study covering multiple conditions to evaluate theperformance of the proposed method. Ultimately, we show how robust maximum marginallikelihood (RMML) estimation significantly improves detection rates for careless responders andreduce bias in item parameters across conditions. Furthermore, we apply our method to a realdataset to illustrate the utility of the proposed method. Our findings suggest that robustestimation coupled with person fit statistics offers a powerful procedure to identify carelessrespondents for further review, and to provide more accurate item parameter estimates inpresence of careless responses.

Download Full-text

Summed Score Likelihood–Based Indices for Testing Latent Variable Distribution Fit in Item Response Theory

Educational and Psychological Measurement ◽

10.1177/0013164417717024 ◽

2017 ◽

Vol 78 (5) ◽

pp. 857-886 ◽

Cited By ~ 4

Author(s):

Zhen Li ◽

Li Cai

Keyword(s):

Item Response Theory ◽

Item Response ◽

Latent Variable ◽

Item Parameter ◽

Parameter Estimates ◽

Test Statistics ◽

Response Theory ◽

Patient Reported ◽

Variable Distribution ◽

Distribution Fit

In standard item response theory (IRT) applications, the latent variable is typically assumed to be normally distributed. If the normality assumption is violated, the item parameter estimates can become biased. Summed score likelihood–based statistics may be useful for testing latent variable distribution fit. We develop Satorra–Bentler type moment adjustments to approximate the test statistics’ tail-area probability. A simulation study was conducted to examine the calibration and power of the unadjusted and adjusted statistics in various simulation conditions. Results show that the proposed indices have tail-area probabilities that can be closely approximated by central chi-squared random variables under the null hypothesis. Furthermore, the test statistics are focused. They are powerful for detecting latent variable distributional assumption violations, and not sensitive (correctly) to other forms of model misspecification such as multidimensionality. As a comparison, the goodness-of-fit statistic M2 has considerably lower power against latent variable nonnormality than the proposed indices. Empirical data from a patient-reported health outcomes study are used as illustration.

Download Full-text