scholarly journals Robust Maximum Marginal Likelihood (RMML) Estimation for Item Response Theory Models

2018 ◽  
Author(s):  
Maxwell Hong ◽  
Alison Cheng

Self-report data are common in psychological and survey research. Unfortunately, manyof these samples are plagued with careless responses due to unmotivated participants. Thepurpose of this study is to propose and evaluate a robust estimation method in order to detectcareless, or unmotivated, responders while leveraging Item Response Theory (IRT) person fitstatistics. First, we outline a general framework for robust estimation specific for IRT models.Subsequently, we conduct a simulation study covering multiple conditions to evaluate theperformance of the proposed method. Ultimately, we show how robust maximum marginallikelihood (RMML) estimation significantly improves detection rates for careless responders andreduce bias in item parameters across conditions. Furthermore, we apply our method to a realdataset to illustrate the utility of the proposed method. Our findings suggest that robustestimation coupled with person fit statistics offers a powerful procedure to identify carelessrespondents for further review, and to provide more accurate item parameter estimates inpresence of careless responses.

2011 ◽  
Vol 35 (8) ◽  
pp. 604-622 ◽  
Author(s):  
Hirotaka Fukuhara ◽  
Akihito Kamata

A differential item functioning (DIF) detection method for testlet-based data was proposed and evaluated in this study. The proposed DIF model is an extension of a bifactor multidimensional item response theory (MIRT) model for testlets. Unlike traditional item response theory (IRT) DIF models, the proposed model takes testlet effects into account, thus estimating DIF magnitude appropriately when a test is composed of testlets. A fully Bayesian estimation method was adopted for parameter estimation. The recovery of parameters was evaluated for the proposed DIF model. Simulation results revealed that the proposed bifactor MIRT DIF model produced better estimates of DIF magnitude and higher DIF detection rates than the traditional IRT DIF model for all simulation conditions. A real data analysis was also conducted by applying the proposed DIF model to a statewide reading assessment data set.


Author(s):  
Riswan Riswan

The Item Response Theory (IRT) model contains one or more parameters in the model. These parameters are unknown, so it is necessary to predict them. This paper aims (1) to determine the sample size (N) on the stability of the item parameter (2) to determine the length (n) test on the stability of the estimate parameter examinee (3) to determine the effect of the model on the stability of the item and the parameter to examine (4) to find out Effect of sample size and test length on item stability and examinee parameter estimates (5) Effect of sample size, test length, and model on item stability and examinee parameter estimates. This paper is a simulation study in which the latent trait (q) sample simulation is derived from a standard normal population of ~ N (0.1), with a specific Sample Size (N) and test length (n) with the 1PL, 2PL and 3PL models using Wingen. Item analysis was carried out using the classical theory test approach and modern test theory. Item Response Theory and data were analyzed through software R with the ltm package. The results showed that the larger the sample size (N), the more stable the estimated parameter. For the length test, which is the greater the test length (n), the more stable the estimated parameter (q).


2020 ◽  
pp. 001316442094989
Author(s):  
Joseph A. Rios ◽  
James Soland

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.


2019 ◽  
Vol 45 (4) ◽  
pp. 383-402
Author(s):  
Paul A. Jewsbury ◽  
Peter W. van Rijn

In large-scale educational assessment data consistent with a simple-structure multidimensional item response theory (MIRT) model, where every item measures only one latent variable, separate unidimensional item response theory (UIRT) models for each latent variable are often calibrated for practical reasons. While this approach can be valid for data from a linear test, unacceptable item parameter estimates are obtained when data arise from a multistage test (MST). We explore this situation from a missing data perspective and show mathematically that MST data will be problematic for calibrating multiple UIRT models but not MIRT models. This occurs because some items that were used in the routing decision are excluded from the separate UIRT models, due to measuring a different latent variable. Both simulated and real data from the National Assessment of Educational Progress are used to further confirm and explore the unacceptable item parameter estimates. The theoretical and empirical results confirm that only MIRT models are valid for item calibration of multidimensional MST data.


1992 ◽  
Vol 17 (2) ◽  
pp. 155-173 ◽  
Author(s):  
Kentaro Yamamoto ◽  
John Mazzeo

In educational assessments, it is often necessary to compare the performance of groups of individuals who have been administered different forms of a test. If these groups are to be validly compared, all results need to be expressed on a common scale. When assessment results are to be reported using an item response theory (IRT) proficiency metric, as is done for the National Assessment of Educational Progress (NAEP), establishing a common metric becomes synonymous with expressing IRT item parameter estimates on a common scale. Procedures that accomplish this are referred to here as scale linking procedures. This chapter discusses the need for scale linking in NAEP and illustrates the specific procedures used to carry out the linking in the context of the major analyses conducted for the 1990 NAEP mathematics assessment.


2017 ◽  
Vol 78 (5) ◽  
pp. 857-886 ◽  
Author(s):  
Zhen Li ◽  
Li Cai

In standard item response theory (IRT) applications, the latent variable is typically assumed to be normally distributed. If the normality assumption is violated, the item parameter estimates can become biased. Summed score likelihood–based statistics may be useful for testing latent variable distribution fit. We develop Satorra–Bentler type moment adjustments to approximate the test statistics’ tail-area probability. A simulation study was conducted to examine the calibration and power of the unadjusted and adjusted statistics in various simulation conditions. Results show that the proposed indices have tail-area probabilities that can be closely approximated by central chi-squared random variables under the null hypothesis. Furthermore, the test statistics are focused. They are powerful for detecting latent variable distributional assumption violations, and not sensitive (correctly) to other forms of model misspecification such as multidimensionality. As a comparison, the goodness-of-fit statistic M2 has considerably lower power against latent variable nonnormality than the proposed indices. Empirical data from a patient-reported health outcomes study are used as illustration.


2013 ◽  
Vol 756-759 ◽  
pp. 2620-2624 ◽  
Author(s):  
Peng Dong Du ◽  
Yan Hua Chu

Based on the item response theory 2PLM parameter estimation method and genetic algorithm in Detailed exploration, put forward a kind of 2PLM based on genetic algorithm parameter estimation method, and the corresponding algorithm program for different item parameter estimation. On the basis of genetic coding, genetic analysis and reference, proposed to the operators of genetic improvement strategy and algorithm to accelerate the convergence of strategy, wove algorithm verification procedures and foreign popular BILOG software were compared, the results showed that, in a certain range of error, the proposed algorithm can converge to the optimal solution.


Sign in / Sign up

Export Citation Format

Share Document