irt models Latest Research Papers

When fitting unidimensional item response theory (IRT) models, the population distribution of the latent trait (θ) is often assumed to be normally distributed. However, some psychological theories would suggest a nonnormal θ. For example, some clinical traits (e.g., alcoholism, depression) are believed to follow a positively skewed distribution where the construct is low for most people, medium for some, and high for few. Failure to account for nonnormality may compromise the validity of inferences and conclusions. Although corrections have been developed to account for nonnormality, these methods can be computationally intensive and have not yet been widely adopted. Previous research has recommended implementing nonnormality corrections when θ is not “approximately normal.” This research focused on examining how far θ can deviate from normal before the normality assumption becomes untenable. Specifically, our goal was to identify the type(s) and degree(s) of nonnormality that result in unacceptable parameter recovery for the graded response model (GRM) and 2-parameter logistic model (2PLM).

Ice Is Hot and Water Is Dry

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000691 ◽

2021 ◽

Author(s):

Natalie Förster ◽

Jörg-Tobias Kuhn

Keyword(s):

Item Difficulty ◽

Second Graders ◽

Design Matrix ◽

Reading Tests ◽

Reading Processes ◽

Irt Models ◽

Item Parameters ◽

Intensity Parameters ◽

Equivalent Test ◽

Theoretical Considerations

Abstract. To monitor students’ progress and adapt instruction to students’ needs, teachers increasingly use repeated assessments of equivalent tests. The present study investigates whether equivalent reading tests can be successfully developed via rule-based item design. Based on theoretical considerations, we identified 3-item features for reading comprehension at the word, sentence, and text levels, respectively, which should influence the difficulty and time intensity of reading processes. Using optimal design algorithms, a design matrix was calculated, and four equivalent test forms of the German reading test series for second graders (quop-L2) were developed. A total of N = 7,751 students completed the tests. We estimated item difficulty and time intensity parameters as well as person ability and speed parameters using bivariate item response theory (IRT) models, and we investigated the influence of item features on item parameters. Results indicate that all item properties significantly affected either item difficulty or response time. Moreover, as indicated by the IRT-based test information functions and analyses of variance, the four different test forms showed similar levels of difficulty and time-intensity at the word, sentence, and text levels (all η2 < .002). Results were successfully cross-validated using a sample of N = 5,654 students.

An R toolbox for score-based measurement invariance tests in IRT models

Behavior Research Methods ◽

10.3758/s13428-021-01689-0 ◽

2021 ◽

Author(s):

Lennart Schneider ◽

Carolin Strobl ◽

Achim Zeileis ◽

Rudolf Debelak

Keyword(s):

Measurement Invariance ◽

Statistical Computing ◽

Partial Credit Model ◽

Generalized Partial Credit Model ◽

The Past ◽

New Family ◽

Irt Models ◽

Item Functioning ◽

Generalized Partial Credit ◽

R Packages

AbstractThe detection of differential item functioning (DIF) is a central topic in psychometrics and educational measurement. In the past few years, a new family of score-based tests of measurement invariance has been proposed, which allows the detection of DIF along arbitrary person covariates in a variety of item response theory (IRT) models. This paper illustrates the application of these tests within the R system for statistical computing, making them accessible to a broad range of users. This presentation also includes IRT models for which these tests have not previously been investigated, such as the generalized partial credit model. The paper has three goals: First, we review the ideas behind score-based tests of measurement invariance. Second, we describe the implementation of these tests within the R system for statistical computing, which is based on the interaction of the R packages mirt, psychotools and strucchange. Third, we illustrate the application of this software and the interpretation of its output in two empirical datasets. The complete R code for reproducing our results is reported in the paper.

Are attitudes toward immigration changing in Europe? An analysis based on latent class IRT models

Advances in Data Analysis and Classification ◽

10.1007/s11634-021-00479-y ◽

2021 ◽

Author(s):

Ewa Genge ◽

Francesco Bartolucci

Keyword(s):

Latent Class ◽

Temporal Dynamics ◽

Latent Trait ◽

European Social Survey ◽

Host Countries ◽

Irt Models ◽

Specific Interpretation ◽

Changing Attitudes ◽

Time Varying Covariates ◽

Attitudes Toward Immigration

AbstractWe analyze the changing attitudes toward immigration in EU host countries in the last few years (2010–2018) on the basis of the European Social Survey data. These data are collected by the administration of a questionnaire made of items concerning different aspects related to the immigration phenomenon. For this analysis, we rely on a latent class approach considering a variety of models that allow for: (1) multidimensionality; (2) discreteness of the latent trait distribution; (3) time-constant and time-varying covariates; and (4) sample weights. Through these models we find latent classes of Europeans with similar levels of immigration acceptance and we study the effect of different socio-economic covariates on the probability of belonging to these classes for which we provide a specific interpretation. In this way we show which countries tend to be more or less positive toward immigration and we analyze the temporal dynamics of the phenomenon under study.

Diagnosing Student Node Mastery: Impact of Varying Item Response Modeling Approaches

Frontiers in Education ◽

10.3389/feduc.2021.715860 ◽

2021 ◽

Vol 6 ◽

Author(s):

Susan Embretson

Keyword(s):

Item Response ◽

State Standards ◽

Core State ◽

Dynamic Learning ◽

Item Response Modeling ◽

Irt Models ◽

Item Level ◽

Mixture Class ◽

Response Modeling ◽

Enhanced Learning

An important feature of learning maps, such as Dynamic Learning Maps and Enhanced Learning Maps, is their ability to accommodate nation-wide specifications of standards, such as the Common Core State Standards, within the map nodes along with relevant instruction. These features are especially useful for remedial instruction, given that accurate diagnosis is available. The year-end achievement tests are potentially useful in this regard. Unfortunately, the current use of total score or area sub-scores are neither sufficiently precise nor sufficiently reliable to diagnose mastery at the node level especially when students vary in their patterns of mastery. The current study examines varying approaches to using the year-end test for diagnosis. Prediction at the item level was obtained using parameters from varying item response theory (IRT) models. The results support using mixture class IRT models predicting mastery in which either items or node scores vary in difficulty for students in different latent classes. Not only did the mixture models fit better but trait score reliability was also maintained for the predictions of node mastery.

Performance of Polytomous IRT Models With Rating Scale Data: An Investigation Over Sample Size, Instrument Length, and Missing Data

Frontiers in Education ◽

10.3389/feduc.2021.721963 ◽

2021 ◽

Vol 6 ◽

Author(s):

Shenghai Dai ◽

Thao Thu Vo ◽

Olasunkanmi James Kehinde ◽

Haixia He ◽

Yu Xue ◽

...

Keyword(s):

Missing Data ◽

Sample Size ◽

Rating Scales ◽

Rating Scale ◽

Small Sample Size ◽

Small Sample ◽

Item Parameter ◽

Partial Credit Model ◽

Irt Models ◽

Polytomous Irt Models

The implementation of polytomous item response theory (IRT) models such as the graded response model (GRM) and the generalized partial credit model (GPCM) to inform instrument design and validation has been increasing across social and educational contexts where rating scales are usually used. The performance of such models has not been fully investigated and compared across conditions with common survey-specific characteristics such as short test length, small sample size, and data missingness. The purpose of the current simulation study is to inform the literature and guide the implementation of GRM and GPCM under these conditions. For item parameter estimations, results suggest a sample size of at least 300 and/or an instrument length of at least five items for both models. The performance of GPCM is stable across instrument lengths while that of GRM improves notably as the instrument length increases. For person parameters, GRM reveals more accurate estimates when the proportion of missing data is small, whereas GPCM is favored in the presence of a large amount of missingness. Further, it is not recommended to compare GRM and GPCM based on test information. Relative model fit indices (AIC, BIC, LL) might not be powerful when the sample size is less than 300 and the length is less than 5. Synthesis of the patterns of the results, as well as recommendations for the implementation of polytomous IRT models, are presented and discussed.

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models

Educational and Psychological Measurement ◽

10.1177/00131644211045351 ◽

2021 ◽

pp. 001316442110453

Author(s):

Gabriel Nagy ◽

Esther Ulitzsch

Keyword(s):

Large Scale ◽

Latent Class ◽

Response Times ◽

Marginal Maximum Likelihood ◽

Irt Model ◽

Large Scale Data ◽

Irt Models ◽

Mixture Irt ◽

Item Parameters ◽

Item Responses

Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals’ proficiency estimates relative to a conventional IRT model.

Matching IRT Models to Patient-Reported Outcomes Constructs: The Graded Response and Log-Logistic Models for Scaling Depression

Psychometrika ◽

10.1007/s11336-021-09802-0 ◽

2021 ◽

Author(s):

Steven P. Reise ◽

Han Du ◽

Emily F. Wong ◽

Anne S. Hubbard ◽

Mark G. Haviland

Keyword(s):

Cognitive Ability ◽

Logistic Model ◽

Patient Reported Outcomes ◽

Logistic Models ◽

Response Model ◽

Graded Response Model ◽

Irt Model ◽

Irt Models ◽

Graded Response ◽

Patient Reported

AbstractItem response theory (IRT) model applications extend well beyond cognitive ability testing, and various patient-reported outcomes (PRO) measures are among the more prominent examples. PRO (and like) constructs differ from cognitive ability constructs in many ways, and these differences have model fitting implications. With a few notable exceptions, however, most IRT applications to PRO constructs rely on traditional IRT models, such as the graded response model. We review some notable differences between cognitive and PRO constructs and how these differences can present challenges for traditional IRT model applications. We then apply two models (the traditional graded response model and an alternative log-logistic model) to depression measure data drawn from the Patient-Reported Outcomes Measurement Information System project. We do not claim that one model is “a better fit” or more “valid” than the other; rather, we show that the log-logistic model may be more consistent with the construct of depression as a unipolar phenomenon. Clearly, the graded response and log-logistic models can lead to different conclusions about the psychometrics of an instrument and the scaling of individual differences. We underscore, too, that, in general, explorations of which model may be more appropriate cannot be decided only by fit index comparisons; these decisions may require the integration of psychometrics with theory and research findings on the construct of interest.

Commentary: Matching IRT Models to PRO Constructs— Modeling Alternatives, and Some Thoughts on What Makes a Model Different

Psychometrika ◽

10.1007/s11336-021-09790-1 ◽

2021 ◽

Author(s):

Matthias von Davier

Keyword(s):

Irt Models

A multidimensional generalized many-facet Rasch model for rubric-based performance assessment

Behaviormetrika ◽

10.1007/s41237-021-00144-w ◽

2021 ◽

Author(s):

Masaki Uto

Keyword(s):

Performance Assessment ◽

Rasch Model ◽

Measurement Accuracy ◽

Estimation Method ◽

Real Data ◽

Monte Carlo Algorithm ◽

Irt Model ◽

Irt Models ◽

Proposed Model ◽

Problem Item

AbstractPerformance assessment, in which human raters assess examinee performance in a practical task, often involves the use of a scoring rubric consisting of multiple evaluation items to increase the objectivity of evaluation. However, even when using a rubric, assigned scores are known to depend on characteristics of the rubric’s evaluation items and the raters, thus decreasing ability measurement accuracy. To resolve this problem, item response theory (IRT) models that can estimate examinee ability while considering the effects of these characteristics have been proposed. These IRT models assume unidimensionality, meaning that a rubric measures one latent ability. In practice, however, this assumption might not be satisfied because a rubric’s evaluation items are often designed to measure multiple sub-abilities that constitute a targeted ability. To address this issue, this study proposes a multidimensional IRT model for rubric-based performance assessment. Specifically, the proposed model is formulated as a multidimensional extension of a generalized many-facet Rasch model. Moreover, a No-U-Turn variant of the Hamiltonian Markov chain Monte Carlo algorithm is adopted as a parameter estimation method for the proposed model. The proposed model is useful not only for improving the ability measurement accuracy, but also for detailed analysis of rubric quality and rubric construct validity. The study demonstrates the effectiveness of the proposed model through simulation experiments and application to real data.

irt models
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Examining the Robustness of the Graded Response and 2-Parameter Logistic Models to Violations of Construct Normality

Ice Is Hot and Water Is Dry

An R toolbox for score-based measurement invariance tests in IRT models

Are attitudes toward immigration changing in Europe? An analysis based on latent class IRT models

Diagnosing Student Node Mastery: Impact of Varying Item Response Modeling Approaches

Performance of Polytomous IRT Models With Rating Scale Data: An Investigation Over Sample Size, Instrument Length, and Missing Data

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models

Matching IRT Models to Patient-Reported Outcomes Constructs: The Graded Response and Log-Logistic Models for Scaling Depression

Commentary: Matching IRT Models to PRO Constructs— Modeling Alternatives, and Some Thoughts on What Makes a Model Different

A multidimensional generalized many-facet Rasch model for rubric-based performance assessment

Export Citation Format

irt modelsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Examining the Robustness of the Graded Response and 2-Parameter Logistic Models to Violations of Construct Normality

Ice Is Hot and Water Is Dry

An R toolbox for score-based measurement invariance tests in IRT models

Are attitudes toward immigration changing in Europe? An analysis based on latent class IRT models

Diagnosing Student Node Mastery: Impact of Varying Item Response Modeling Approaches

Performance of Polytomous IRT Models With Rating Scale Data: An Investigation Over Sample Size, Instrument Length, and Missing Data

A Multilevel Mixture IRT Framework for Modeling Response Times as Predictors or Indicators of Response Engagement in IRT Models

Matching IRT Models to Patient-Reported Outcomes Constructs: The Graded Response and Log-Logistic Models for Scaling Depression

Commentary: Matching IRT Models to PRO Constructs— Modeling Alternatives, and Some Thoughts on What Makes a Model Different

A multidimensional generalized many-facet Rasch model for rubric-based performance assessment

irt models
Recently Published Documents