An Approach to Scoring and Equating Tests With Binary Items

Dimiter M. Dimitrov

doi:10.1177/0013164416631100

An Approach to Scoring and Equating Tests With Binary Items

Educational and Psychological Measurement ◽

10.1177/0013164416631100 ◽

2016 ◽

Vol 76 (6) ◽

pp. 954-975 ◽

Cited By ~ 7

Author(s):

Dimiter M. Dimitrov

Keyword(s):

Item Response ◽

Large Scale ◽

Test Characteristic ◽

Test Items ◽

Item Parameters ◽

Large Scale Testing ◽

True Score Equating ◽

Large Scale Assessments ◽

Response Vector ◽

Better Than

This article describes an approach to test scoring, referred to as delta scoring ( D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee’s response vector, which is weighted by the expected difficulties (not “easiness”) of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.

Modeling Answer Change Behavior: An Application of a Generalized Item Response Tree Model

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998616688015 ◽

2017 ◽

Vol 42 (4) ◽

pp. 467-490 ◽

Cited By ~ 8

Author(s):

Minjeong Jeon ◽

Paul De Boeck ◽

Wim van der Linden

Keyword(s):

Item Response ◽

Large Scale ◽

Tree Model ◽

Academic Cheating ◽

Test Items ◽

Change Behavior ◽

Mathematics Test ◽

Item Parameters ◽

Latent Traits

We present a novel application of a generalized item response tree model to investigate test takers’ answer change behavior. The model allows us to simultaneously model the observed patterns of the initial and final responses after an answer change as a function of a set of latent traits and item parameters. The proposed application is illustrated with large-scale mathematics test items. We also describe how the estimated results can be used to study the benefits of answer change and to further detect potential academic cheating.

Analytics in International Large-Scale Assessments: Item Response Theory and Population Models

Handbook of International Large-Scale Assessment ◽

10.1201/b16061-12 ◽

2013 ◽

pp. 169-188

Keyword(s):

Item Response Theory ◽

Item Response ◽

Large Scale ◽

Population Models ◽

Response Theory ◽

Large Scale Assessments

Postexamination Analysis of Objective Tests Using the Three-Parameter Item Response Theory

Journal of Nursing Measurement ◽

10.1891/1061-3749.22.1.94 ◽

2014 ◽

Vol 22 (1) ◽

pp. 94-105

Author(s):

Mohsen Tavakol ◽

Mohammad Rahimi-Madiseh ◽

Reg Dennick

Keyword(s):

Item Response Theory ◽

Item Response ◽

Item Parameter ◽

Response Theory ◽

Characteristic Curves ◽

Test Characteristic ◽

Test Items ◽

Test Information ◽

Objective Tests

Background and Purpose: Although the importance of item response theory (IRT) has been emphasized in health and medical education, in practice, few psychometricians in nurse education have used these methods to create tests that discriminate well at any level of student ability. The purpose of this study is to evaluate the psychometric properties of a real objective test using three-parameter IRT. Methods: Three-parameter IRT was used to monitor and improve the quality of the test items. Results: Item parameter indices, item characteristic curves (ICCs), test information functions, and test characteristic curves reveal aberrant items which do not assess the construct being measured. Conclusions: The results of this study provide useful information for educators to improve the quality of assessment, teaching strategies, and curricula.

Testing & Opening in Augustusburg A Success Story?

10.1101/2021.06.16.21258869 ◽

2021 ◽

Author(s):

Marc Diederichs ◽

Timo Friedel Mitze ◽

Felix Schulz ◽

Klaus Waelde

Keyword(s):

Disease Surveillance ◽

Large Scale ◽

Community Level ◽

Small Scale ◽

County Level ◽

Success Story ◽

Large Scale Testing ◽

The City ◽

Better Than

The city of Augustusburg allowed for opening of, inter alia, restaurants and hotels joint with large-scale testing. We evaluate this testing & opening (T&O) experiment by comparing the evolution of case rates in Augustusburg with the evolution in other communities of Saxony. We have access to small-scale SARS-CoV-2 infection data at the community level (Gemeinde) instead of the county level (Landkreis) usually used for disease surveillance. Despite data challenges, we conclude that T&O did not lead to any increase in case rates in Augustusburg compared to its control county. When we measure the effect of T&O on cumulative cases, we find a small increase in Augustusburg. This difference almost completely disappears when we control for the effect of higher case rates due to more testing. Generally speaking, T&O worked much better than in comparable projects elsewhere.

A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning

Foundations ◽

10.3390/foundations1010009 ◽

2021 ◽

Vol 1 (1) ◽

pp. 116-144

Author(s):

Alexander Robitzsch

Keyword(s):

Differential Item Functioning ◽

Item Response ◽

Large Scale ◽

Response Model ◽

Item Response Model ◽

Large Scale Assessment ◽

Item Functioning ◽

Item Parameters ◽

Concurrent Calibration ◽

Two Parameter

This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difficulties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.

Two Wheels are Better than One: The Importance of Capturing the Home Literacy Environment in Large-Scale Assessments of Reading

Research in Comparative and International Education ◽

10.2304/rcie.2013.8.3.359 ◽

2013 ◽

Vol 8 (3) ◽

pp. 359-372 ◽

Cited By ~ 3

Author(s):

Amy Jo Dowd ◽

Lauren Pisani

Keyword(s):

Large Scale ◽

Home Literacy Environment ◽

Home Literacy ◽

Literacy Environment ◽

Large Scale Assessments ◽

Better Than

Bayesian analysis of hierarchical IRT models: comparing and combining the unidimensional and multi-unidimensional IRT models

10.32469/10355/4153 ◽

2005 ◽

Author(s):

◽

Yanyan Sheng

Keyword(s):

Item Response Theory ◽

Bayesian Analysis ◽

Item Response ◽

Large Scale ◽

Test Items ◽

Irt Model ◽

Irt Models ◽

Proposed Model ◽

Multidimensional Models ◽

Item Response Theory Models

As item response theory models gain increased popularity in large scale educational and measurement testing situations, many studies have been conducted on the development and applications of unidimensional and multidimensional models. However, to date, no study has yet looked at models in the IRT framework with an overall ability dimension underlying all test items and several ability dimensions specific for each subtest. This study is to propose such a model and compare it with the conventional IRT models using Bayesian methodology. The results suggest that the proposed model offers a better way to represent the test situations not realized in existing models. The model specifications for the proposed model also give rise to implications for test developers on test designing. In addition, the proposed IRT model can be applied in other areas, such as intelligence or psychology, among others.

Approximating Bifactor IRT True-Score Equating With a Projective Item Response Model

Applied Psychological Measurement ◽

10.1177/0146621619885903 ◽

2019 ◽

Vol 44 (3) ◽

pp. 215-218

Author(s):

Kyung Yong Kim ◽

Uk Hyun Cho

Keyword(s):

Numerical Integration ◽

Item Response ◽

Bifactor Model ◽

True Score ◽

Item Response Model ◽

Software Packages ◽

Item Parameters ◽

True Score Equating ◽

Specific Factors ◽

Item Response Function

Item response theory (IRT) true-score equating for the bifactor model is often conducted by first numerically integrating out specific factors from the item response function and then applying the unidimensional IRT true-score equating method to the marginalized bifactor model. However, an alternative procedure for obtaining the marginalized bifactor model is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension. Projection, which can be viewed as an approximation to numerical integration, has an advantage over numerical integration in providing item parameters for the marginalized bifactor model; therefore, projection could be used with existing equating software packages that require item parameters. In this paper, IRT true-score equating results obtained with projection are compared to those obtained with numerical integration. Simulation results show that the two procedures provide very similar equating results.

Perbedaan Ketidakwajaran Skor Berdasarkan Teori Respon Butir ditinjau dari Model Penskoran Jawaban Siswa SMP se Kota Palangka Raya

Anterior Jurnal ◽

10.33084/anterior.v16i1.77 ◽

2016 ◽

Vol 16 (1) ◽

pp. 28-39

Author(s):

Sugiharto Sugiharto

Keyword(s):

High Schools ◽

Item Response ◽

Junior High ◽

Junior High Schools ◽

Test Items ◽

Correct Score ◽

Significant Difference ◽

The Difference ◽

The City ◽

Better Than

This study aims to know about the difference of unfair score based on item response theory referred to the model of students� answers scoring of Junior High Schools throughout the city of Palangka Raya. The samples were taken from 17 state and private Junior High Schools throughout the city of Palangka Raya. Furthermore, the students� answers were corrected by using scoring models, namely punishment score and correct score. Before doing the test to take the data, the data had to be validated first, both in the content and the empirical data. From the 40 test items, they have obtained 30 valid items. To obtain the proportion of fair score, it was used the estimation using BILOG-MG program. Furthermore, the data were analyzed with the different proportions (Z) of the two groups. The results of the data analysis showed that Zcount is -2.806, while Ztable is -1.65 so Zcount rank outside the receipt area of H0. It shows that the students with the scoring models of punishment score encompass the score that more than the scoring models of the correct score with a significant difference in the proportion. It can be concluded that the students who were corrected by using scoring models of punishment score have a fair index better than the fair index of the students were corrected by using scoring models of correct score.

Why Large-Scale Assessments Use Scaling and Item Response Theory

Implementation of Large-Scale Education Assessments ◽

10.1002/9781118762462.ch13 ◽

2017 ◽

pp. 323-356 ◽

Cited By ~ 5

Author(s):

Alla Berezner ◽

Raymond J. Adams

Keyword(s):

Item Response Theory ◽

Item Response ◽

Large Scale ◽

Response Theory ◽

Large Scale Assessments