On the Far Reaching Effects of Using Biased Estimates of Score Reliability: An Examination of the Problem in 20 Data Analyses

2001 ◽  
Vol 89 (2) ◽  
pp. 291-307 ◽  
Author(s):  
Gilbert Becker

Violation of either of two basic assumptions in classical test theory may lead to biased estimates of reliability. Violation of the assumption of essential tau-equivalence may produce underestimates, and the presence of correlated errors among measurement units may result in overestimates. The ubiquity of circumstances in which this problem may occur is not fully comprehended by many workers. This article surveys a variety of settings in which biased reliability estimates may be found in an effort to increase awareness of the prevalence of the problem.

2001 ◽  
Vol 89 (2) ◽  
pp. 403-424 ◽  
Author(s):  
Gilbert Becker

Two assumptions in classical test theory, essential tau-equivalence and independence of measurement errors, when violated may produce attenuated or inflated estimates of reliability, respectively. Inflation stemming from correlated errors can be controlled by a procedure in which systematically created equivalent halves of a given measuring instrument are administered across two occasions. When poor approximations to equivalent halves are constructed for this purpose, however, distortion in the opposite direction may result, being sometimes quite large when measuring instruments are not essentially tau-equivalent (or, at the practical level, unidimensional). The nature of these decrements are discussed and illustrated, and a number of procedures for eliminating them introduced.


2014 ◽  
Vol 35 (4) ◽  
pp. 250-261 ◽  
Author(s):  
Matthias Ziegler ◽  
Arthur Poropat ◽  
Julija Mell

Short personality questionnaires are increasingly used in research and practice, with some scales including as few as two to five items per personality domain. Despite the frequency of their use, these short scales are often criticized on the basis of their reduced internal consistencies and their purported failure to assess the breadth of broad constructs, such as the Big 5 factors of personality. One reason for this might be the use of principles routed in Classical Test Theory during test construction. In this study, Generalizability Theory is used to compare psychometric properties of different scales based on the NEO-PI-R and BFI, two widely-used personality questionnaire families. Applying both Classical Test Theory (CTT) and Generalizability Theory (GT) allowed to identify the inner workings of test shortening. CTT-based analyses indicated that longer is generally better for reliability, while GT allowed differentiation between reliability for relative and absolute decisions, while revealing how different variance sources affect test score reliability estimates. These variance sources differed with scale length, and only GT allowed clear description of these internal consequences, allowing more effective identification of advantages and disadvantages of shorter and longer scales. Most importantly, the findings highlight the potential error proneness of focusing solely on reliability and scale length in test construction. Practical as well as theoretical consequences are discussed.


2020 ◽  
Author(s):  
Peter E Clayson ◽  
Scott Baldwin ◽  
Michael J. Larson

In studies of event-related brain potentials (ERPs), difference scores between conditions in a task are frequently used to isolate neural activity for use as a dependent or independent variable. Adequate score reliability is a prerequisite for studies examining relationships between ERPs and external correlates, but there is a widely held view that difference scores are inherently unreliable and unsuitable for studies of individual differences. This view fails to consider the nuances of difference score reliability that are relevant to ERP research. In the present study, we provide formulas from classical test theory and generalizability theory for estimating the internal consistency of subtraction-based and residualized difference scores. These formulas are then applied to error-related negativity (ERN) and reward positivity (RewP) difference scores from the same sample of 117 participants. Analyses demonstrate that ERN difference scores can be reliable, which supports their use in studies of individual differences. However, RewP difference scores yielded poor reliability due to the high correlation between the constituent reward and non-reward ERPs. Findings emphasize that difference score reliability largely depends on the internal consistency of constituent scores and the correlation between those scores. Furthermore, generalizability theory estimates yielded higher internal consistency estimates for subtraction-based difference scores than classical test theory estimates did. Despite some beliefs that difference scores are inherently unreliable, ERP difference scores can show adequate reliability and be useful for isolating neural activity in studies of individual differences.


2021 ◽  
Vol 104 (3) ◽  
pp. 003685042110283
Author(s):  
Meltem Yurtcu ◽  
Hülya Kelecioglu ◽  
Edward L Boone

Bayesian Nonparametric (BNP) modelling can be used to obtain more detailed information in test equating studies and to increase the accuracy of equating by accounting for covariates. In this study, two covariates are included in the equating under the Bayes nonparametric model, one is continuous, and the other is discrete. Scores equated with this model were obtained for a single group design for a small group in the study. The equated scores obtained with the model were compared with the mean and linear equating methods in the Classical Test Theory. Considering the equated scores obtained from three different methods, it was found that the equated scores obtained with the BNP model produced a distribution closer to the target test. Even the classical methods will give a good result with the smallest error when using a small sample, making equating studies valuable. The inclusion of the covariates in the model in the classical test equating process is based on some assumptions and cannot be achieved especially using small groups. The BNP model will be more beneficial than using frequentist methods, regardless of this limitation. Information about booklets and variables can be obtained from the distributors and equated scores that obtained with the BNP model. In this case, it makes it possible to compare sub-categories. This can be expressed as indicating the presence of differential item functioning (DIF). Therefore, the BNP model can be used actively in test equating studies, and it provides an opportunity to examine the characteristics of the individual participants at the same time. Thus, it allows test equating even in a small sample and offers the opportunity to reach a value closer to the scores in the target test.


Assessment ◽  
2021 ◽  
pp. 107319112199416
Author(s):  
Desirée Blázquez-Rincón ◽  
Juan I. Durán ◽  
Juan Botella

A reliability generalization meta-analysis was carried out to estimate the average reliability of the seven-item, 5-point Likert-type Fear of COVID-19 Scale (FCV-19S), one of the most widespread scales developed around the COVID-19 pandemic. Different reliability coefficients from classical test theory and the Rasch Measurement Model were meta-analyzed, heterogeneity among the most reported reliability estimates was examined by searching for moderators, and a predictive model to estimate the expected reliability was proposed. At least one reliability estimate was available for a total of 44 independent samples out of 42 studies, being that Cronbach’s alpha was most frequently reported. The coefficients exhibited pooled estimates ranging from .85 to .90. The moderator analyses led to a predictive model in which the standard deviation of scores explained 36.7% of the total variability among alpha coefficients. The FCV-19S has been shown to be consistently reliable regardless of the moderator variables examined.


Sign in / Sign up

Export Citation Format

Share Document