scholarly journals An Approach to Scoring and Equating Tests With Binary Items

2016 ◽  
Vol 76 (6) ◽  
pp. 954-975 ◽  
Author(s):  
Dimiter M. Dimitrov

This article describes an approach to test scoring, referred to as delta scoring ( D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee’s response vector, which is weighted by the expected difficulties (not “easiness”) of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.


2017 ◽  
Vol 42 (4) ◽  
pp. 467-490 ◽  
Author(s):  
Minjeong Jeon ◽  
Paul De Boeck ◽  
Wim van der Linden

We present a novel application of a generalized item response tree model to investigate test takers’ answer change behavior. The model allows us to simultaneously model the observed patterns of the initial and final responses after an answer change as a function of a set of latent traits and item parameters. The proposed application is illustrated with large-scale mathematics test items. We also describe how the estimated results can be used to study the benefits of answer change and to further detect potential academic cheating.



2014 ◽  
Vol 22 (1) ◽  
pp. 94-105
Author(s):  
Mohsen Tavakol ◽  
Mohammad Rahimi-Madiseh ◽  
Reg Dennick

Background and Purpose: Although the importance of item response theory (IRT) has been emphasized in health and medical education, in practice, few psychometricians in nurse education have used these methods to create tests that discriminate well at any level of student ability. The purpose of this study is to evaluate the psychometric properties of a real objective test using three-parameter IRT. Methods: Three-parameter IRT was used to monitor and improve the quality of the test items. Results: Item parameter indices, item characteristic curves (ICCs), test information functions, and test characteristic curves reveal aberrant items which do not assess the construct being measured. Conclusions: The results of this study provide useful information for educators to improve the quality of assessment, teaching strategies, and curricula.



2021 ◽  
Author(s):  
Marc Diederichs ◽  
Timo Friedel Mitze ◽  
Felix Schulz ◽  
Klaus Waelde

The city of Augustusburg allowed for opening of, inter alia, restaurants and hotels joint with large-scale testing. We evaluate this testing & opening (T&O) experiment by comparing the evolution of case rates in Augustusburg with the evolution in other communities of Saxony. We have access to small-scale SARS-CoV-2 infection data at the community level (Gemeinde) instead of the county level (Landkreis) usually used for disease surveillance. Despite data challenges, we conclude that T&O did not lead to any increase in case rates in Augustusburg compared to its control county. When we measure the effect of T&O on cumulative cases, we find a small increase in Augustusburg. This difference almost completely disappears when we control for the effect of higher case rates due to more testing. Generally speaking, T&O worked much better than in comparable projects elsewhere.



Foundations ◽  
2021 ◽  
Vol 1 (1) ◽  
pp. 116-144
Author(s):  
Alexander Robitzsch

This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difficulties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.



2005 ◽  
Author(s):  
◽  
Yanyan Sheng

As item response theory models gain increased popularity in large scale educational and measurement testing situations, many studies have been conducted on the development and applications of unidimensional and multidimensional models. However, to date, no study has yet looked at models in the IRT framework with an overall ability dimension underlying all test items and several ability dimensions specific for each subtest. This study is to propose such a model and compare it with the conventional IRT models using Bayesian methodology. The results suggest that the proposed model offers a better way to represent the test situations not realized in existing models. The model specifications for the proposed model also give rise to implications for test developers on test designing. In addition, the proposed IRT model can be applied in other areas, such as intelligence or psychology, among others.



2019 ◽  
Vol 44 (3) ◽  
pp. 215-218
Author(s):  
Kyung Yong Kim ◽  
Uk Hyun Cho

Item response theory (IRT) true-score equating for the bifactor model is often conducted by first numerically integrating out specific factors from the item response function and then applying the unidimensional IRT true-score equating method to the marginalized bifactor model. However, an alternative procedure for obtaining the marginalized bifactor model is through projecting the nuisance dimensions of the bifactor model onto the dominant dimension. Projection, which can be viewed as an approximation to numerical integration, has an advantage over numerical integration in providing item parameters for the marginalized bifactor model; therefore, projection could be used with existing equating software packages that require item parameters. In this paper, IRT true-score equating results obtained with projection are compared to those obtained with numerical integration. Simulation results show that the two procedures provide very similar equating results.



2016 ◽  
Vol 16 (1) ◽  
pp. 28-39
Author(s):  
Sugiharto Sugiharto

This study aims to know about the difference of unfair score based on item response theory referred to the model of students� answers scoring of Junior High Schools throughout the city of Palangka Raya. The samples were taken from 17 state and private Junior High Schools throughout the city of Palangka Raya. Furthermore, the students� answers were corrected by using scoring models, namely punishment score and correct score. Before doing the test to take the data, the data had to be validated first, both in the content and the empirical data. From the 40 test items, they have obtained 30 valid items. To obtain the proportion of fair score, it was used the estimation using BILOG-MG program. Furthermore, the data were analyzed with the different proportions (Z) of the two groups. The results of the data analysis showed that Zcount is -2.806, while Ztable is -1.65 so Zcount rank outside the receipt area of H0. It shows that the students with the scoring models of punishment score encompass the score that more than the scoring models of the correct score with a significant difference in the proportion. It can be concluded that the students who were corrected by using scoring models of punishment score have a fair index better than the fair index of the students were corrected by using scoring models of correct score.



Sign in / Sign up

Export Citation Format

Share Document