The Hierarchical Rater Model for Rated Test Items and its Application to Large-Scale Educational Assessment Data

2002 ◽  
Vol 27 (4) ◽  
pp. 341-384 ◽  
Author(s):  
Richard J. Patz ◽  
Brian W. Junker ◽  
Matthew S. Johnson ◽  
Louis T. Mariano

Open-ended or “constructed” student responses to test items have become a stock component of standardized educational assessments. Digital imaging of examinee work now enables a distributed rating process to be flexibly managed, and allocation designs that involve as many as six or more ratings for a subset of responses are now feasible. In this article we develop Patz’s (1996) hierarchical rater model (HRM) for polytomous item response data scored by multiple raters, and show how it can be used to scale examinees and items, to model aspects of consensus among raters, and to model individual rater severity and consistency effects. The HRM treats examinee responses to open-ended items as unobsered discrete varibles, and it explicitly models the “proficiency” of raters in assigning accurate scores as well as the proficiency of examinees in providing correct responses. We show how the HRM “fits in” to the generalizability theory framework that has been the traditional tool of analysis for rated item response data, and give some relationships between the HRM, the design effects correction of Bock, Brennan and Muraki (1999), and the rater bundle model of Wilson and Hoskens (2002). Using simulated and real data, we compare the HRM to the conventional IRT Facets model for rating data (e.g., Linacre, 1989; Engelhard, 1994, 1996), and we explore ways that information from HRM analyses may improved the quality of the rating process.

2020 ◽  
Vol 44 (5) ◽  
pp. 362-375
Author(s):  
Tyler Strachan ◽  
Edward Ip ◽  
Yanyan Fu ◽  
Terry Ackerman ◽  
Shyh-Huei Chen ◽  
...  

As a method to derive a “purified” measure along a dimension of interest from response data that are potentially multidimensional in nature, the projective item response theory (PIRT) approach requires first fitting a multidimensional item response theory (MIRT) model to the data before projecting onto a dimension of interest. This study aims to explore how accurate the PIRT results are when the estimated MIRT model is misspecified. Specifically, we focus on using a (potentially misspecified) two-dimensional (2D)-MIRT for projection because of its advantages, including interpretability, identifiability, and computational stability, over higher dimensional models. Two large simulation studies (I and II) were conducted. Both studies examined whether the fitting of a 2D-MIRT is sufficient to recover the PIRT parameters when multiple nuisance dimensions exist in the test items, which were generated, respectively, under compensatory MIRT and bifactor models. Various factors were manipulated, including sample size, test length, latent factor correlation, and number of nuisance dimensions. The results from simulation studies I and II showed that the PIRT was overall robust to a misspecified 2D-MIRT. Smaller third and fourth simulation studies were done to evaluate recovery of the PIRT model parameters when the correctly specified higher dimensional MIRT or bifactor model was fitted with the response data. In addition, a real data set was used to illustrate the robustness of PIRT.


2021 ◽  
Author(s):  
Benjamin Domingue ◽  
Dimiter Dimitrov

A recently developed framework of measurement, referred to as Delta-scoring (or D-scoring) method (DSM; e.g., Dimitrov 2016, 2018, 2020) is gaining attention in the field of educational measurement and widely used in large-scale assessments at the National Center for Assessment in Saudi Arabia. The D-scores obtained under the DSM range from 0 to 1 to indicate how much (what proportion) of the ability measured by a test of binary items is demonstrated by the examinee. This study examines whether the D-scale is an interval scale and how D-scores compare to IRT ability scores (thetas) in terms of intervalness via testing the axioms of additive conjoint measurement (ACM). The approach to testing is the ConjointChecks (Domingue, 2014), which implements a Bayesian method to evaluating whether the axioms are violated in a given empirical item response data set. The results indicate that the D-scores, computed under the DSM, produce fewer violations of the ordering axioms of ACM than do the IRT “theta” scores. The conclusion is that the DSM produces a dependable D-scale in terms of the essential property of intervalness.


2021 ◽  
Author(s):  
Ben Stenhaug ◽  
Michael C. Frank ◽  
Benjamin Domingue

Differential item functioning (DIF) is a popular technique within the item-response theory framework for detecting test items that are biased against particular demographic groups. The last thirty years have brought significant methodological advances in detecting DIF. Still, typical methods—such as matching on sum scores or identifying anchor items—are based exclusively on internal criteria and therefore rely on a crucial piece of circular logic: items with DIF are identified via an assumption that other items do not have DIF. This logic is an attempt to solve an easy-to-overlook identification problem at the beginning of most DIF detection. We explore this problem, which we describe as the Fundamental DIF Identification Problem, in depth here. We suggest three steps for determining whether it is surmountable and DIF detection results can be trusted. (1) Examine raw item response data for potential DIF. To this end, we introduce a new graphical method for visualizing potential DIF in raw item response data. (2) Compare the results of a variety of methods. These methods, which we describe in detail, include commonly-used anchor item methods, recently-proposed anchor point methods, and our suggested adaptations. (3) Interpret results in light of the possibility of DIF methods failing. We illustrate the basic challenge and the methodological options using the classic verbal aggression data and a simulation study. We recommend best practices for cautious DIF detection.


2019 ◽  
Vol 45 (4) ◽  
pp. 383-402
Author(s):  
Paul A. Jewsbury ◽  
Peter W. van Rijn

In large-scale educational assessment data consistent with a simple-structure multidimensional item response theory (MIRT) model, where every item measures only one latent variable, separate unidimensional item response theory (UIRT) models for each latent variable are often calibrated for practical reasons. While this approach can be valid for data from a linear test, unacceptable item parameter estimates are obtained when data arise from a multistage test (MST). We explore this situation from a missing data perspective and show mathematically that MST data will be problematic for calibrating multiple UIRT models but not MIRT models. This occurs because some items that were used in the routing decision are excluded from the separate UIRT models, due to measuring a different latent variable. Both simulated and real data from the National Assessment of Educational Progress are used to further confirm and explore the unacceptable item parameter estimates. The theoretical and empirical results confirm that only MIRT models are valid for item calibration of multidimensional MST data.


2015 ◽  
Vol 2015 ◽  
pp. 1-13
Author(s):  
ByoungWook Kim ◽  
JaMee Kim ◽  
WonGyu Lee

The item response data is thenm-dimensional data based on the responses made bymexaminees to the questionnaire consisting ofnitems. It is used to estimate the ability of examinees and item parameters in educational evaluation. For estimates to be valid, the simulation input data must reflect reality. This paper presents the effective combination of the genetic algorithm (GA) and Monte Carlo methods for the generation of item response data as simulation input data similar to real data. To this end, we generated four types of item response data using Monte Carlo and the GA and evaluated how similarly the generated item response data represents the real item response data with the item parameters (item difficulty and discrimination). We adopt two types of measurement, which are root mean square error and Kullback-Leibler divergence, for comparison of item parameters between real data and four types of generated data. The results show that applying the GA to initial population generated by Monte Carlo is the most effective in generating item response data that is most similar to real item response data. This study is meaningful in that we found that the GA contributes to the generation of more realistic simulation input data.


2005 ◽  
Author(s):  
◽  
Yanyan Sheng

As item response theory models gain increased popularity in large scale educational and measurement testing situations, many studies have been conducted on the development and applications of unidimensional and multidimensional models. However, to date, no study has yet looked at models in the IRT framework with an overall ability dimension underlying all test items and several ability dimensions specific for each subtest. This study is to propose such a model and compare it with the conventional IRT models using Bayesian methodology. The results suggest that the proposed model offers a better way to represent the test situations not realized in existing models. The model specifications for the proposed model also give rise to implications for test developers on test designing. In addition, the proposed IRT model can be applied in other areas, such as intelligence or psychology, among others.


1993 ◽  
Vol 18 (1) ◽  
pp. 41-68 ◽  
Author(s):  
Ratna Nandakumar ◽  
William Stout

This article provides a detailed investigation of Stout’s statistical procedure (the computer program DIMTEST) for testing the hypothesis that an essentially unidimensional latent trait model fits observed binary item response data from a psychological test. One finding was that DIMTEST may fail to perform as desired in the presence of guessing when coupled with many high-discriminating items. A revision of DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised to determine the size of the assessment subtests. Further, an adjustment is made on the estimated standard error of the statistic on which DIMTEST depends. These three refinements have led to an improved procedure that is shown in simulation studies to adhere closely to the nominal level of signficance while achieving considerably greater power. Finally, DIMTEST is validated on a selection of real data sets.


2016 ◽  
Vol 76 (6) ◽  
pp. 954-975 ◽  
Author(s):  
Dimiter M. Dimitrov

This article describes an approach to test scoring, referred to as delta scoring ( D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee’s response vector, which is weighted by the expected difficulties (not “easiness”) of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.


2017 ◽  
Vol 42 (4) ◽  
pp. 467-490 ◽  
Author(s):  
Minjeong Jeon ◽  
Paul De Boeck ◽  
Wim van der Linden

We present a novel application of a generalized item response tree model to investigate test takers’ answer change behavior. The model allows us to simultaneously model the observed patterns of the initial and final responses after an answer change as a function of a set of latent traits and item parameters. The proposed application is illustrated with large-scale mathematics test items. We also describe how the estimated results can be used to study the benefits of answer change and to further detect potential academic cheating.


Sign in / Sign up

Export Citation Format

Share Document