Guidelines for Authors Reporting Score Reliability Estimates

2011 ◽  
pp. 90-101 ◽  
Author(s):  
Thompson Bruce
2001 ◽  
Vol 89 (2) ◽  
pp. 291-307 ◽  
Author(s):  
Gilbert Becker

Violation of either of two basic assumptions in classical test theory may lead to biased estimates of reliability. Violation of the assumption of essential tau-equivalence may produce underestimates, and the presence of correlated errors among measurement units may result in overestimates. The ubiquity of circumstances in which this problem may occur is not fully comprehended by many workers. This article surveys a variety of settings in which biased reliability estimates may be found in an effort to increase awareness of the prevalence of the problem.


2017 ◽  
Vol 21 (3) ◽  
pp. 255-268
Author(s):  
Meghan K. Crouch ◽  
Diane E. Mack ◽  
Philip M. Wilson ◽  
Matthew Y. W. Kwan

Using reliability generalization analysis, the purpose of this study was to characterize the average score reliability, the variability of the score reliability estimates, and explore possible characteristics (e.g., sample size) that influence the reliability of scores across studies using the Scales of Psychological Wellbeing (PWB; Ryff, 1989 , 2014 ). Published studies were included in this investigation if they appeared in a peer-reviewed journal, used 1 or more PWB subscales, estimated coefficient alpha value(s) for the PWB subscale(s), and were written in English. Of the 924 articles generated by the search strategy, a total of 264 were included in the final sample for meta-analysis. The average value reported for coefficient alpha referencing the composite PWB Scale was 0.858, with mean coefficient alphas ranging from 0.722 for the autonomy subscale to 0.801 for the self-acceptance subscale. The 95% prediction intervals ranged from [.653, .996] for the composite PWB. The lower bound of the prediction intervals for specific subscales were >.350. Moderator analyses revealed significant differences in score reliability estimates across select sample and test characteristics. Most notably, R2 values linked with test length ranged from 40% to 71%. Concerns were identified with the use of the 3-item per PWB subscale which reinforces claims advanced by Ryff (2014) . Suggestions for researchers using the PWB are advanced which span measurement considerations and standards of reporting. Psychological researchers who calculate score reliability estimates within their own work should recognize the implications of alpha coefficient values on validity, null hypothesis significant testing, and effect sizes.


2010 ◽  
Vol 107 (1) ◽  
pp. 95-112 ◽  
Author(s):  
Jung Hee Ha ◽  
Sang Min Lee ◽  
Ana Puig

Perfectionism has been identified as a common concern among clients who seek counseling services. For more than 20 years, the Frost Multidimensional Perfectionism Scale (F–MPS) has been used extensively to measure the construct of individuals' perfectionism. The current study used reliability generalization to identify the average score reliability as well as variables explaining the variability of score reliability. Typical reliability across subscale scores ranged from .71 to .86 with the Doubt about Action subscale showing the least variability and the Organization subscale showing the most. In addition, sex, language, and standard deviation of the scale had statistically significant relations to reliability estimates.


2005 ◽  
Vol 27 (1) ◽  
pp. 71 ◽  
Author(s):  
Paul Westrick

This study examines the piloting of a commercially-produced test of English, the Quick Placement Test – Pen and Paper Test (QPT-PPT). In consecutive administrations of two versions of the test with 161 first-year students at a Japanese university, the test results failed to discriminate among students of varying proficiencies. Narrow ranges, low score reliability estimates, and large standard errors of measurement characterized the results. Item analysis revealed that most of the test items did little to separate high and low scoring students. The data also suggests that test anxiety, familiarity with the test format, and test-taking skills were important factors in the test scores. この研究は、日本の大学生161名に対して行ったQuick Placement Test – Pen and Paper Test (QPT-PPT) という一般英語力テストの結果である。 学生はこの2つのテストを連続して受けた。テスト結果は明確に学生のレベルを分けるものではなかった。得点差があまりなかったのである。信頼性係数は低く、測定の標準誤差が大きかったと言える。項目分析の結果、殆んどの項目は効果がなかった。このデータはまた、学生のテストに対する不安感や、テストフォーマット知識の有無や、テスト慣れをも示唆している。


2016 ◽  
Vol 34 (2) ◽  
pp. 271-289 ◽  
Author(s):  
Chih-Kai Lin

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the technical complexity involved in estimating score reliability from sparse-rated data. Examining the estimation precision of reliability is of great importance because the utility of any performance-based language test depends on its reliability. Results suggest that when some raters are expected to have greater score variability than other raters (e.g., a mixture of novice and experienced raters being deployed in a rating session), the sub-dividing method is recommended as it yields more precise reliability estimates. When all raters are expected to exhibit similar variability in their scoring, both the rating and sub-dividing methods are equally precise in estimating score reliability, and the rating method is recommended for operational use, as it is easier to implement in practice. Informed by these methodological results, the current study also demonstrates a step-by-step analysis for investigating the score reliability from sparse-rated data taken from a large-scale English speaking proficiency test. Implications for operational performance-based language tests are discussed.


2014 ◽  
Vol 30 (2) ◽  
pp. 130-139 ◽  
Author(s):  
Gilles E. Gignac

Researchers have the implicit option of calculating internal consistency reliability (coefficient α) for total scale scores derived from multidimensional inventories based on either the inter-item correlation matrix (item unit-level) or the inter-subscale correlation matrix (subscale unit-level). It is demonstrated that item unit-level and subscale unit-level reliability estimates often diverge substantially in practice. Specifically, the item unit-level reliability estimation is often larger than the corresponding subscale unit-level estimate. It is recommended that if researchers calculate total scale score reliability at the item unit-level, then a model-based approach to the estimation of internal consistency reliability (i.e., omega hierarchical) should be applied, when the underlying model is multidimensional. If omega hierarchical cannot be applied for any particular reason, it is recommended that total scale score reliabilities be calculated at the subscale unit-level of analysis, not the item unit-level.


2013 ◽  
Vol 34 (1) ◽  
pp. 32-40 ◽  
Author(s):  
Matthias Ziegler ◽  
Christoph Kemper ◽  
Beatrice Rammstedt

The present research aimed at constructing a questionnaire measuring overclaiming tendencies (VOC-T-bias) as an indicator of self-enhancement. An approach was used which also allows estimation of a score for vocabulary knowledge, the accuracy index (VOC-T-accuracy), using signal detection theory. For construction purposes, an online study was conducted with N = 1,176 participants. The resulting questionnaire, named Vocabulary and Overclaiming – Test (VOC-T) was investigated with regard to its psychometric properties in two further studies. Study 2 used data from a population representative sample (N = 527), and Study 3 was another online survey (N = 933). Results show that reliability estimates were satisfactory for the VOC-T-bias index and the VOC-T-accuracy index. Overclaiming did not correlate with knowledge, but it was sensitive to self-enhancement supporting the construct validity of the test scores. The VOC-T-accuracy index in turn covaried with general knowledge and even more so with verbal knowledge, which also supports construct validity. Moreover, the VOC-T-accuracy index had a meaningful correlation with age in both validation studies. All in all, the psychometric properties can be regarded as sufficient to recommend the VOC-T for research purposes.


2020 ◽  
Author(s):  
Kristy Martire ◽  
Agnes Bali ◽  
Kaye Ballantyne ◽  
Gary Edmond ◽  
Richard Kemp ◽  
...  

We do not know how often false positive reports are made in a range of forensic science disciplines. In the absence of this information it is important to understand the naive beliefs held by potential jurors about forensic science evidence reliability. It is these beliefs that will shape evaluations at trial. This descriptive study adds to our knowledge about naive beliefs by: 1) measuring jury-eligible (lay) perceptions of reliability for the largest range of forensic science disciplines to date, over three waves of data collection between 2011 and 2016 (n = 674); 2) calibrating reliability ratings with false positive report estimates; and 3) comparing lay reliability estimates with those of an opportunity sample of forensic practitioners (n = 53). Overall the data suggest that both jury-eligible participants and practitioners consider forensic evidence highly reliable. When compared to best or plausible estimates of reliability and error in the forensic sciences these views appear to overestimate reliability and underestimate the frequency of false positive errors. This result highlights the importance of collecting and disseminating empirically derived estimates of false positive error rates to ensure that practitioners and potential jurors have a realistic impression of the value of forensic science evidence.


Sign in / Sign up

Export Citation Format

Share Document