Guidelines for Authors Reporting Score Reliability Estimates

Violation of either of two basic assumptions in classical test theory may lead to biased estimates of reliability. Violation of the assumption of essential tau-equivalence may produce underestimates, and the presence of correlated errors among measurement units may result in overestimates. The ubiquity of circumstances in which this problem may occur is not fully comprehended by many workers. This article surveys a variety of settings in which biased reliability estimates may be found in an effort to increase awareness of the prevalence of the problem.

Download Full-text

Variability of Coefficient Alpha: An Empirical Investigation of the Scales of Psychological Wellbeing

Review of General Psychology ◽

10.1037/gpr0000112 ◽

2017 ◽

Vol 21 (3) ◽

pp. 255-268

Author(s):

Meghan K. Crouch ◽

Diane E. Mack ◽

Philip M. Wilson ◽

Matthew Y. W. Kwan

Keyword(s):

Meta Analysis ◽

Prediction Intervals ◽

Coefficient Alpha ◽

Psychological Wellbeing ◽

Average Score ◽

Average Value ◽

Test Characteristics ◽

Reliability Estimates ◽

Final Sample ◽

Score Reliability

Using reliability generalization analysis, the purpose of this study was to characterize the average score reliability, the variability of the score reliability estimates, and explore possible characteristics (e.g., sample size) that influence the reliability of scores across studies using the Scales of Psychological Wellbeing (PWB; Ryff, 1989 , 2014 ). Published studies were included in this investigation if they appeared in a peer-reviewed journal, used 1 or more PWB subscales, estimated coefficient alpha value(s) for the PWB subscale(s), and were written in English. Of the 924 articles generated by the search strategy, a total of 264 were included in the final sample for meta-analysis. The average value reported for coefficient alpha referencing the composite PWB Scale was 0.858, with mean coefficient alphas ranging from 0.722 for the autonomy subscale to 0.801 for the self-acceptance subscale. The 95% prediction intervals ranged from [.653, .996] for the composite PWB. The lower bound of the prediction intervals for specific subscales were >.350. Moderator analyses revealed significant differences in score reliability estimates across select sample and test characteristics. Most notably, R2 values linked with test length ranged from 40% to 71%. Concerns were identified with the use of the 3-item per PWB subscale which reinforces claims advanced by Ryff (2014) . Suggestions for researchers using the PWB are advanced which span measurement considerations and standards of reporting. Psychological researchers who calculate score reliability estimates within their own work should recognize the implications of alpha coefficient values on validity, null hypothesis significant testing, and effect sizes.

Download Full-text

A Reliability Generalization Study of the Frost Multidimensional Perfectionism Scale (F–MPS)

Psychological Reports ◽

10.2466/03.09.20.pr0.107.4.95-112 ◽

2010 ◽

Vol 107 (1) ◽

pp. 95-112 ◽

Cited By ~ 4

Author(s):

Jung Hee Ha ◽

Sang Min Lee ◽

Ana Puig

Keyword(s):

Standard Deviation ◽

Average Score ◽

Counseling Services ◽

Reliability Generalization ◽

Common Concern ◽

Reliability Estimates ◽

Score Reliability

Perfectionism has been identified as a common concern among clients who seek counseling services. For more than 20 years, the Frost Multidimensional Perfectionism Scale (F–MPS) has been used extensively to measure the construct of individuals' perfectionism. The current study used reliability generalization to identify the average score reliability as well as variables explaining the variability of score reliability. Typical reliability across subscale scores ranged from .71 to .86 with the Doubt about Action subscale showing the least variability and the Organization subscale showing the most. In addition, sex, language, and standard deviation of the scale had statistically significant relations to reliability estimates.

Download Full-text

Research Forum: Score Reliability and Placement Testing

JALT Journal - JALT Journal 24.1 ◽

10.37546/jaltjj27.1-4 ◽

2005 ◽

Vol 27 (1) ◽

pp. 71 ◽

Cited By ~ 1

Author(s):

Paul Westrick

Keyword(s):

Item Analysis ◽

Placement Test ◽

First Year ◽

Placement Testing ◽

Paper Test ◽

First Year Students ◽

Test Items ◽

Test Taking ◽

Reliability Estimates ◽

Score Reliability

This study examines the piloting of a commercially-produced test of English, the Quick Placement Test – Pen and Paper Test (QPT-PPT). In consecutive administrations of two versions of the test with 161 first-year students at a Japanese university, the test results failed to discriminate among students of varying proficiencies. Narrow ranges, low score reliability estimates, and large standard errors of measurement characterized the results. Item analysis revealed that most of the test items did little to separate high and low scoring students. The data also suggests that test anxiety, familiarity with the test format, and test-taking skills were important factors in the test scores. この研究は、日本の大学生161名に対して行ったQuick Placement Test – Pen and Paper Test (QPT-PPT) という一般英語力テストの結果である。学生はこの2つのテストを連続して受けた。テスト結果は明確に学生のレベルを分けるものではなかった。得点差があまりなかったのである。信頼性係数は低く、測定の標準誤差が大きかったと言える。項目分析の結果、殆んどの項目は効果がなかった。このデータはまた、学生のテストに対する不安感や、テストフォーマット知識の有無や、テスト慣れをも示唆している。

Download Full-text

Working with sparse data in rated language tests: Generalizability theory applications

Language Testing ◽

10.1177/0265532216638890 ◽

2016 ◽

Vol 34 (2) ◽

pp. 271-289 ◽

Cited By ~ 2

Author(s):

Chih-Kai Lin

Keyword(s):

Large Scale ◽

Generalizability Theory ◽

Operational Performance ◽

Speaking Proficiency ◽

Language Tests ◽

Reliability Estimates ◽

Score Reliability ◽

Rating Method ◽

English Speaking ◽

Rating Session

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the technical complexity involved in estimating score reliability from sparse-rated data. Examining the estimation precision of reliability is of great importance because the utility of any performance-based language test depends on its reliability. Results suggest that when some raters are expected to have greater score variability than other raters (e.g., a mixture of novice and experienced raters being deployed in a rating session), the sub-dividing method is recommended as it yields more precise reliability estimates. When all raters are expected to exhibit similar variability in their scoring, both the rating and sub-dividing methods are equally precise in estimating score reliability, and the rating method is recommended for operational use, as it is easier to implement in practice. Informed by these methodological results, the current study also demonstrates a step-by-step analysis for investigating the score reliability from sparse-rated data taken from a large-scale English speaking proficiency test. Implications for operational performance-based language tests are discussed.

Download Full-text

On the Inappropriateness of Using Items to Calculate Total Scale Score Reliability via Coefficient Alpha for Multidimensional Scales

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000181 ◽

2014 ◽

Vol 30 (2) ◽

pp. 130-139 ◽

Cited By ~ 39

Author(s):

Gilles E. Gignac

Keyword(s):

Scale Score ◽

Internal Consistency ◽

Correlation Matrix ◽

Internal Consistency Reliability ◽

Reliability Estimation ◽

Unit Level ◽

Reliability Estimates ◽

Score Reliability ◽

Scale Scores ◽

Total Scale Score

Researchers have the implicit option of calculating internal consistency reliability (coefficient α) for total scale scores derived from multidimensional inventories based on either the inter-item correlation matrix (item unit-level) or the inter-subscale correlation matrix (subscale unit-level). It is demonstrated that item unit-level and subscale unit-level reliability estimates often diverge substantially in practice. Specifically, the item unit-level reliability estimation is often larger than the corresponding subscale unit-level estimate. It is recommended that if researchers calculate total scale score reliability at the item unit-level, then a model-based approach to the estimation of internal consistency reliability (i.e., omega hierarchical) should be applied, when the underlying model is multidimensional. If omega hierarchical cannot be applied for any particular reason, it is recommended that total scale score reliabilities be calculated at the subscale unit-level of analysis, not the item unit-level.

Download Full-text

The Vocabulary and Overclaiming Test (VOC-T)

Journal of Individual Differences ◽

10.1027/1614-0001/a000093 ◽

2013 ◽

Vol 34 (1) ◽

pp. 32-40 ◽

Cited By ~ 16

Author(s):

Matthias Ziegler ◽

Christoph Kemper ◽

Beatrice Rammstedt

Keyword(s):

Construct Validity ◽

Psychometric Properties ◽

Signal Detection Theory ◽

Online Survey ◽

Vocabulary Knowledge ◽

Online Study ◽

Accuracy Index ◽

Bias Index ◽

Reliability Estimates ◽

Verbal Knowledge

The present research aimed at constructing a questionnaire measuring overclaiming tendencies (VOC-T-bias) as an indicator of self-enhancement. An approach was used which also allows estimation of a score for vocabulary knowledge, the accuracy index (VOC-T-accuracy), using signal detection theory. For construction purposes, an online study was conducted with N = 1,176 participants. The resulting questionnaire, named Vocabulary and Overclaiming – Test (VOC-T) was investigated with regard to its psychometric properties in two further studies. Study 2 used data from a population representative sample (N = 527), and Study 3 was another online survey (N = 933). Results show that reliability estimates were satisfactory for the VOC-T-bias index and the VOC-T-accuracy index. Overclaiming did not correlate with knowledge, but it was sensitive to self-enhancement supporting the construct validity of the test scores. The VOC-T-accuracy index in turn covaried with general knowledge and even more so with verbal knowledge, which also supports construct validity. Moreover, the VOC-T-accuracy index had a meaningful correlation with age in both validation studies. All in all, the psychometric properties can be regarded as sufficient to recommend the VOC-T for research purposes.

Download Full-text

Ion propulsion flight experience life tests and reliability estimates

10.2514/6.1973-1256 ◽

1973 ◽

Cited By ~ 1

Author(s):

J. MOLITOR

Keyword(s):

Flight Experience ◽

Reliability Estimates ◽

Ion Propulsion ◽

Life Tests

Download Full-text

Forensic science evidence: Naive estimates of false positive error rates and reliability

10.31234/osf.io/anxur ◽

2020 ◽

Author(s):

Kristy Martire ◽

Agnes Bali ◽

Kaye Ballantyne ◽

Gary Edmond ◽

Richard Kemp ◽

...

Keyword(s):

Forensic Science ◽

False Positive ◽

Error Rates ◽

Forensic Sciences ◽

Positive Error ◽

False Positive Error ◽

Reliability Estimates ◽

Positive Report ◽

Science Evidence ◽

Science Disciplines

We do not know how often false positive reports are made in a range of forensic science disciplines. In the absence of this information it is important to understand the naive beliefs held by potential jurors about forensic science evidence reliability. It is these beliefs that will shape evaluations at trial. This descriptive study adds to our knowledge about naive beliefs by: 1) measuring jury-eligible (lay) perceptions of reliability for the largest range of forensic science disciplines to date, over three waves of data collection between 2011 and 2016 (n = 674); 2) calibrating reliability ratings with false positive report estimates; and 3) comparing lay reliability estimates with those of an opportunity sample of forensic practitioners (n = 53). Overall the data suggest that both jury-eligible participants and practitioners consider forensic evidence highly reliable. When compared to best or plausible estimates of reliability and error in the forensic sciences these views appear to overestimate reliability and underestimate the frequency of false positive errors. This result highlights the importance of collecting and disseminating empirically derived estimates of false positive error rates to ensure that practitioners and potential jurors have a realistic impression of the value of forensic science evidence.

Download Full-text

Faculty Opinions recommendation of Measuring quality of life in women with endometriosis: tests of data quality, score reliability, response rate and scaling assumptions of the Endometriosis Health Profile Questionnaire.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1058951.510906 ◽

2007 ◽

Author(s):

Thomas d'Hooghe

Keyword(s):

Quality Of Life ◽

Data Quality ◽

Response Rate ◽

Quality Score ◽

Health Profile ◽

Score Reliability

Download Full-text