Calibrating the Medical Council of Canada’s Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs

Purpose: The aim of this research was to compare different methods of calibrating multiple choice question (MCQ) and clinical decision making (CDM) components for the Medical Council of Canada’s Qualifying Examination Part I (MCCQEI) based on item response theory. Methods: Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores) calibrations were conducted using PARSCALE 4. Results: The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02). In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43%) and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%). Conclusion: Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.

Download Full-text

Mokken Scale Analysis: Discussion and Application

Advances in Social Sciences Research Journal ◽

10.14738/assrj.83.9949 ◽

2021 ◽

Vol 8 (3) ◽

pp. 672-695

Author(s):

Thomas DeVaney

Keyword(s):

Item Response Theory ◽

Item Response ◽

Rating Scale ◽

Scale Analysis ◽

Guttman Scale ◽

Response Theory ◽

Data Set ◽

Mokken Scale Analysis ◽

Irt Models ◽

Guttman Scaling

This article presents a discussion and illustration of Mokken scale analysis (MSA), a nonparametric form of item response theory (IRT), in relation to common IRT models such as Rasch and Guttman scaling. The procedure can be used for dichotomous and ordinal polytomous data commonly used with questionnaires. The assumptions of MSA are discussed as well as characteristics that differentiate a Mokken scale from a Guttman scale. MSA is illustrated using the mokken package with R Studio and a data set that included over 3,340 responses to a modified version of the Statistical Anxiety Rating Scale. Issues addressed in the illustration include monotonicity, scalability, and invariant ordering. The R script for the illustration is included.

Download Full-text

Examination of Different Item Response Theory Models on Tests Composed of Testlets

Journal of Education and Learning ◽

10.5539/jel.v6n4p113 ◽

2017 ◽

Vol 6 (4) ◽

pp. 113

Author(s):

Esin Yilmaz Kogar ◽

Hülya Kelecioglu

Keyword(s):

Item Response Theory ◽

Sample Size ◽

Item Response ◽

Data Sets ◽

Meaningful Difference ◽

Response Theory ◽

Size Change ◽

Data Set ◽

Testlet Response Theory ◽

Item Response Theory Models

The purpose of this research is to first estimate the item and ability parameters and the standard error values related to those parameters obtained from Unidimensional Item Response Theory (UIRT), bifactor (BIF) and Testlet Response Theory models (TRT) in the tests including testlets, when the number of testlets, number of independent items, and sample size change, and then to compare the obtained results. Mathematic test in PISA 2012 was employed as the data collection tool, and 36 items were used to constitute six different data sets containing different numbers of testlets and independent items. Subsequently, from these constituted data sets, three different sample sizes of 250, 500 and 1000 persons were selected randomly. When the findings of the research were examined, it was determined that, generally the lowest mean error values were those obtained from UIRT, and TRT yielded a mean of error estimation lower than that of BIF. It was found that, under all conditions, models which take into consideration the local dependency have provided a better model-data compatibility than UIRT, generally there is no meaningful difference between BIF and TRT, and both models can be used for those data sets. It can be said that when there is a meaningful difference between those two models, generally BIF yields a better result. In addition, it has been determined that, in each sample size and data set, item and ability parameters and correlations of errors of the parameters are generally high.

Download Full-text

Obtaining Classical Reliability Terms from Item Response Theory in Multiple Choice Tests

Ankara Universitesi Egitim Bilimleri Fakultesi Dergisi ◽

10.1501/egifak_0000000138 ◽

2006 ◽

pp. 001-018

Author(s):

Halil YURDUGÜL

Keyword(s):

Item Response Theory ◽

Item Response ◽

Multiple Choice ◽

Response Theory ◽

Multiple Choice Tests ◽

Choice Tests

Download Full-text

A Bifactor Multidimensional Item Response Theory Model for Differential Item Functioning Analysis on Testlet-Based Items

Applied Psychological Measurement ◽

10.1177/0146621611428447 ◽

2011 ◽

Vol 35 (8) ◽

pp. 604-622 ◽

Cited By ~ 16

Author(s):

Hirotaka Fukuhara ◽

Akihito Kamata

Keyword(s):

Item Response Theory ◽

Differential Item Functioning ◽

Item Response ◽

Estimation Method ◽

Multidimensional Item Response Theory ◽

Multidimensional Item Response ◽

Response Theory ◽

Data Set ◽

Detection Rates ◽

Item Functioning

A differential item functioning (DIF) detection method for testlet-based data was proposed and evaluated in this study. The proposed DIF model is an extension of a bifactor multidimensional item response theory (MIRT) model for testlets. Unlike traditional item response theory (IRT) DIF models, the proposed model takes testlet effects into account, thus estimating DIF magnitude appropriately when a test is composed of testlets. A fully Bayesian estimation method was adopted for parameter estimation. The recovery of parameters was evaluated for the proposed DIF model. Simulation results revealed that the proposed bifactor MIRT DIF model produced better estimates of DIF magnitude and higher DIF detection rates than the traditional IRT DIF model for all simulation conditions. A real data analysis was also conducted by applying the proposed DIF model to a statewide reading assessment data set.

Download Full-text

Item Response Theory Applied to Combinations of Multiple-Choice and Constructed-Response Items—Approximation Methods for Scale Scores

Test Scoring ◽

10.4324/9781410604729-15 ◽

2001 ◽

pp. 305-354

Keyword(s):

Item Response Theory ◽

Item Response ◽

Multiple Choice ◽

Approximation Methods ◽

Response Theory ◽

Constructed Response ◽

Scale Scores

Download Full-text

Robustness of Projective IRT to Misspecification of the Underlying Multidimensional Model

Applied Psychological Measurement ◽

10.1177/0146621620909894 ◽

2020 ◽

Vol 44 (5) ◽

pp. 362-375

Author(s):

Tyler Strachan ◽

Edward Ip ◽

Yanyan Fu ◽

Terry Ackerman ◽

Shyh-Huei Chen ◽

...

Keyword(s):

Item Response Theory ◽

Item Response ◽

Real Data ◽

Model Parameters ◽

Simulation Studies ◽

Response Theory ◽

Computational Stability ◽

Data Set ◽

Response Data ◽

Higher Dimensional

As a method to derive a “purified” measure along a dimension of interest from response data that are potentially multidimensional in nature, the projective item response theory (PIRT) approach requires first fitting a multidimensional item response theory (MIRT) model to the data before projecting onto a dimension of interest. This study aims to explore how accurate the PIRT results are when the estimated MIRT model is misspecified. Specifically, we focus on using a (potentially misspecified) two-dimensional (2D)-MIRT for projection because of its advantages, including interpretability, identifiability, and computational stability, over higher dimensional models. Two large simulation studies (I and II) were conducted. Both studies examined whether the fitting of a 2D-MIRT is sufficient to recover the PIRT parameters when multiple nuisance dimensions exist in the test items, which were generated, respectively, under compensatory MIRT and bifactor models. Various factors were manipulated, including sample size, test length, latent factor correlation, and number of nuisance dimensions. The results from simulation studies I and II showed that the PIRT was overall robust to a misspecified 2D-MIRT. Smaller third and fourth simulation studies were done to evaluate recovery of the PIRT model parameters when the correctly specified higher dimensional MIRT or bifactor model was fitted with the response data. In addition, a real data set was used to illustrate the robustness of PIRT.

Download Full-text

Comparison of Finite State Score Theory, Classical Test Theory, and Item Response Theory in Scoring Multiple-Choice Items

Educational and Psychological Measurement ◽

10.1177/0013164497057004004 ◽

1997 ◽

Vol 57 (4) ◽

pp. 580-589 ◽

Cited By ~ 5

Author(s):

Joyce L. Ndalichako ◽

W. Todd Rogers

Keyword(s):

Item Response Theory ◽

Item Response ◽

Classical Test Theory ◽

Multiple Choice ◽

Test Theory ◽

Response Theory ◽

Classical Test ◽

Finite State ◽

Multiple Choice Items ◽

State Score

Download Full-text

The Psychometrics Tests Properties of Multiple Choice and Completion Test “A comparison Study by Using Item Response Theory”

Journal of Educational & Psychological Sciences ◽

10.12785/jeps/130313 ◽

2012 ◽

Vol 13 (03) ◽

pp. 375-404

Author(s):

Hamdy Y. Abu Jarad

Keyword(s):

Item Response Theory ◽

Item Response ◽

Multiple Choice ◽

Comparison Study ◽

Response Theory ◽

Completion Test

Download Full-text

Establishing Thresholds for Meaningful Within-individual Change Using Longitudinal Item Response Theory

10.21203/rs.3.rs-371137/v1 ◽

2021 ◽

Author(s):

Jakob Bue Bjorner ◽

Berend Terluin ◽

Andrew Trigg ◽

Jinxiang Hu ◽

Keri J.S. Brady ◽

...

Keyword(s):

Item Response Theory ◽

Item Response ◽

Individual Change ◽

Response Theory ◽

Traditional Methods ◽

Data Set ◽

Score Improvement ◽

Patient Reported ◽

Longitudinal Item Response

Abstract PURPOSE: Thresholds for meaningful within-individual change (MWIC) are useful for interpreting patient-reported outcome measures (PROM). Transition ratings (TR) have been recommended as anchors to establish MWIC. Traditional statistical methods for analyzing MWIC such as mean change analysis, receiver operating characteristic (ROC) analysis, and predictive modeling ignore problems of floor/ceiling effects and measurement error in the PROM scores and the TR item. We present a novel approach to MWIC estimation for multi-item scales using longitudinal item response theory (LIRT).METHODS: A Graded Response LIRT model for baseline and follow-up PROM data was expanded to include a TR item measuring latent change. The LIRT threshold parameter for the TR established the MWIC threshold on the latent metric, from which the observed PROM score MWIC threshold was estimated. We compared the LIRT approach and traditional methods using an example data set with baseline and three follow-up assessments differing by magnitude of score improvement, variance of score improvement, and baseline-follow-up score correlation.RESULTS: The LIRT model provided good fit to the data. LIRT estimates of observed PROM MWIC varied between 3 and 4 points score improvement. In contrast, results from traditional methods varied from 2 points to 10 points - strongly associated with proportion of self-rated improvement. Best agreement between methods was seen when approximately 50% rated their health as improved.CONCLUSION : Results from traditional analyses of anchor-based MWIC are impacted by study conditions. LIRT constitutes a promising and more robust analytic approach to identifying thresholds for MWIC.

Download Full-text

Quantitatively ranking incorrect responses to multiple-choice questions using item response theory

Physical Review Physics Education Research ◽

10.1103/physrevphyseducres.16.010107 ◽

2020 ◽

Vol 16 (1) ◽

Author(s):

Trevor I. Smith ◽

Kyle J. Louis ◽

Bartholomew J. Ricci ◽

Nasrine Bendjilali

Keyword(s):

Item Response Theory ◽

Item Response ◽

Multiple Choice ◽

Multiple Choice Questions ◽

Response Theory

Download Full-text