Meeting assumptions in the estimation of reliability

Brian P. Shaw

doi:10.1177/1536867x211063407

Meeting assumptions in the estimation of reliability

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x211063407 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1021-1027

Author(s):

Brian P. Shaw

Keyword(s):

Reliability Estimates ◽

Statistical Assumptions

Researchers and psychometricians have long used Cronbach’s α as a measure of reliability. However, there have been growing calls to replace Cronbach’s α with measures that have more defensible assumptions. One of the most common and straightforward recommended reliability estimates is ω. After a review of reliability and its estimation in Stata, I introduce the community-contributed command omegacoef. This command reports McDonald’s ω in a format similar to the base alpha command. omegacoef provides Stata users the ability to easily compute estimates of reliability with the confidence that the necessary statistical assumptions are met.

Download Full-text

The Vocabulary and Overclaiming Test (VOC-T)

Journal of Individual Differences ◽

10.1027/1614-0001/a000093 ◽

2013 ◽

Vol 34 (1) ◽

pp. 32-40 ◽

Cited By ~ 16

Author(s):

Matthias Ziegler ◽

Christoph Kemper ◽

Beatrice Rammstedt

Keyword(s):

Construct Validity ◽

Psychometric Properties ◽

Signal Detection Theory ◽

Online Survey ◽

Vocabulary Knowledge ◽

Online Study ◽

Accuracy Index ◽

Bias Index ◽

Reliability Estimates ◽

Verbal Knowledge

The present research aimed at constructing a questionnaire measuring overclaiming tendencies (VOC-T-bias) as an indicator of self-enhancement. An approach was used which also allows estimation of a score for vocabulary knowledge, the accuracy index (VOC-T-accuracy), using signal detection theory. For construction purposes, an online study was conducted with N = 1,176 participants. The resulting questionnaire, named Vocabulary and Overclaiming – Test (VOC-T) was investigated with regard to its psychometric properties in two further studies. Study 2 used data from a population representative sample (N = 527), and Study 3 was another online survey (N = 933). Results show that reliability estimates were satisfactory for the VOC-T-bias index and the VOC-T-accuracy index. Overclaiming did not correlate with knowledge, but it was sensitive to self-enhancement supporting the construct validity of the test scores. The VOC-T-accuracy index in turn covaried with general knowledge and even more so with verbal knowledge, which also supports construct validity. Moreover, the VOC-T-accuracy index had a meaningful correlation with age in both validation studies. All in all, the psychometric properties can be regarded as sufficient to recommend the VOC-T for research purposes.

Download Full-text

Ion propulsion flight experience life tests and reliability estimates

10.2514/6.1973-1256 ◽

1973 ◽

Cited By ~ 1

Author(s):

J. MOLITOR

Keyword(s):

Flight Experience ◽

Reliability Estimates ◽

Ion Propulsion ◽

Life Tests

Download Full-text

Forensic science evidence: Naive estimates of false positive error rates and reliability

10.31234/osf.io/anxur ◽

2020 ◽

Author(s):

Kristy Martire ◽

Agnes Bali ◽

Kaye Ballantyne ◽

Gary Edmond ◽

Richard Kemp ◽

...

Keyword(s):

Forensic Science ◽

False Positive ◽

Error Rates ◽

Forensic Sciences ◽

Positive Error ◽

False Positive Error ◽

Reliability Estimates ◽

Positive Report ◽

Science Evidence ◽

Science Disciplines

We do not know how often false positive reports are made in a range of forensic science disciplines. In the absence of this information it is important to understand the naive beliefs held by potential jurors about forensic science evidence reliability. It is these beliefs that will shape evaluations at trial. This descriptive study adds to our knowledge about naive beliefs by: 1) measuring jury-eligible (lay) perceptions of reliability for the largest range of forensic science disciplines to date, over three waves of data collection between 2011 and 2016 (n = 674); 2) calibrating reliability ratings with false positive report estimates; and 3) comparing lay reliability estimates with those of an opportunity sample of forensic practitioners (n = 53). Overall the data suggest that both jury-eligible participants and practitioners consider forensic evidence highly reliable. When compared to best or plausible estimates of reliability and error in the forensic sciences these views appear to overestimate reliability and underestimate the frequency of false positive errors. This result highlights the importance of collecting and disseminating empirically derived estimates of false positive error rates to ensure that practitioners and potential jurors have a realistic impression of the value of forensic science evidence.

Download Full-text

Addressing Reliability Estimates When the Population is Small: Can Social Cognition Help?

Psychological Reports ◽

10.2466/pr0.1993.73.2.499 ◽

1993 ◽

Vol 73 (2) ◽

pp. 499-505 ◽

Cited By ~ 1

Author(s):

Scott W. Brown ◽

Mary M. Brown

Keyword(s):

Social Cognition ◽

Communication Satisfaction ◽

Systematic Desensitization ◽

Internal Reliability ◽

Satisfaction Scale ◽

Measurement Framework ◽

Hospital Intensive Care Unit ◽

Reliability Estimates ◽

Critical Feedback ◽

Gathering Data

One of the issues facing researchers studying very select populations is how to obtain reliability estimates for their instruments. When the populations and resulting samples of studies are very small and select, gathering data for typical reliability estimates becomes very difficult. As a result, many researchers ignore the concern about reliability of their instrumentation and forge ahead collecting data. In response to this concern, Bandura's model of social cognition and Wolpe's model of systematic desensitization were combined and applied to a group of 90 undergraduates completing a Communication Satisfaction Scale designed to assess the attitudes of intubated patients in a hospital Intensive Care Unit. Stimuli (text, auditory and visual) were provided to sensitize the subjects to the intubation procedure and to enable the subjects to imagine what it is like to be an intubated patient. The subjects responded to 10 items focusing on the communication issues of intubated patients on a scale in Likert format. Internal reliability (Cronbach alpha) was 0.83 for the entire scale. The results are discussed within both a social cognition and a measurement framework. While the resulting reliabilities cannot be directly applied to the intubated sample, the procedure may provide critical feedback to researchers and instrument developers prior to the actual administration of the instrument in research.

Download Full-text

Scale Evaluation and Eligibility Determination of a Field-Test Version of the Assessment, Evaluation, and Programming System Third Edition

Topics in Early Childhood Special Education ◽

10.1177/0271121420981712 ◽

2021 ◽

pp. 027112142098171

Author(s):

Michael D. Toland ◽

Jennifer Grisham ◽

Misti Waddell ◽

Rebecca Crawford ◽

David M. Dueber

Keyword(s):

Field Test ◽

Programming System ◽

Eligibility Determination ◽

Fit Statistics ◽

Reliability Estimates ◽

Test Version ◽

Acceptable Model ◽

Cutoff Scores ◽

Area Classification

Rasch and classification analyses on a field-test version of the Assessment, Evaluation, and Programming System Test—Third Edition (AEPS-3), a curriculum-based assessment used to assess young children birth to age 6 years, were conducted. First, an evaluation of the psychometric properties of data from each developmental area of an AEPS-3 field-test version was conducted. Next, cutoff scores at 6-month age intervals were created and then the validity of the cutoff scores was evaluated. Results using Rasch modeling indicated acceptable model fit statistics with reasonable reliability estimates within each developmental area. Classification results showed cutoff scores accurately classified a high percentage of eligible children. Findings suggest that scores from a field-test version of the AEPS-3 are reliable within developmental areas. To the extent allowed by state criteria, early childhood interventionists could possibly use a new field-test version of the AEPS-3 to determine or corroborate eligibility for special education services.

Download Full-text

A Comparison of Reliability Estimation Based on Confirmatory Factor Analysis and Exploratory Structural Equation Models

Educational and Psychological Measurement ◽

10.1177/00131644211008953 ◽

2021 ◽

pp. 001316442110089

Author(s):

Yuanshu Fu ◽

Zhonglin Wen ◽

Yang Wang

Keyword(s):

Factor Analysis ◽

Confirmatory Factor Analysis ◽

Structural Equation ◽

Reliability Estimation ◽

Model Fit ◽

Equation Modeling ◽

Factor Loadings ◽

Reliability Estimates ◽

Confirmatory Factor ◽

Composite Reliability

Composite reliability, or coefficient omega, can be estimated using structural equation modeling. Composite reliability is usually estimated under the basic independent clusters model of confirmatory factor analysis (ICM-CFA). However, due to the existence of cross-loadings, the model fit of the exploratory structural equation model (ESEM) is often found to be substantially better than that of ICM-CFA. The present study first illustrated the method used to estimate composite reliability under ESEM and then compared the difference between ESEM and ICM-CFA in terms of composite reliability estimation under various indicators per factor, target factor loadings, cross-loadings, and sample sizes. The results showed no apparent difference in using ESEM or ICM-CFA for estimating composite reliability, and the rotation type did not affect the composite reliability estimates generated by ESEM. An empirical example was given as further proof of the results of the simulation studies. Based on the present study, we suggest that if the model fit of ESEM (regardless of the utilized rotation criteria) is acceptable but that of ICM-CFA is not, the composite reliability estimates based on the above two models should be similar. If the target factor loadings are relatively small, researchers should increase the number of indicators per factor or increase the sample size.

Download Full-text

To Reverse Item Orientation or Not to Reverse Item Orientation, That Is the Question

Assessment ◽

10.1177/10731911211017635 ◽

2021 ◽

pp. 107319112110176

Author(s):

David M. Dueber ◽

Michael D. Toland ◽

John Eric Lingat ◽

Abigail M. A. Love ◽

Chen Qiu ◽

...

Keyword(s):

Factor Model ◽

Measurement Model ◽

Self Esteem ◽

Experimental Conditions ◽

Path Coefficients ◽

Reliability Estimates ◽

Item Functioning ◽

Influence Measurement ◽

Item Orientation ◽

Amazon's Mechanical Turk

To investigate the effect of using negatively oriented items, we wrote semantic reversals of the items in the Rosenberg Self Esteem Scale, UCLA Loneliness Scale, and the General Belongingness Scale and used them to create four experimental conditions. Participants ( N = 2,019) were recruited through Amazon’s Mechanical Turk. Data were assessed for dimensionality, item functioning, instrument properties, and associations with other variables. Regarding dimensionality, although a two-factor model (positively vs. negatively oriented factors) exhibits better fit than a unidimensional model across all conditions, bifactor indices were used to argue that a unidimensional interpretation of the data can be employed. With respect to item functioning, factor loadings were found to be nearly invariant across conditions, but thresholds were not. Concerning instrument properties, inclusion of negatively oriented items results in lower mean scores and higher score variances. Instruments with both positively and negatively oriented items demonstrated lower reliability estimates than those with only one orientation. For associations with other variables, path coefficients in a model where loneliness mediates the effects of belongingness on life satisfaction and self-esteem were found to vary across conditions. Findings suggest that negatively oriented items have minor impact on instrument quality, but influence measurement model and path coefficients.

Download Full-text

Reliability Estimates for IRT-Based Forced-Choice Assessment Scores

Organizational Research Methods ◽

10.1177/1094428121999086 ◽

2021 ◽

pp. 109442812199908

Author(s):

Yin Lin

Keyword(s):

Impression Management ◽

Empirical Studies ◽

Forced Choice ◽

Reliability Estimation ◽

Estimation Methods ◽

High Stakes ◽

Personnel Decisions ◽

Assessment Scores ◽

Reliability Estimates ◽

Different Types

Forced-choice (FC) assessments of noncognitive psychological constructs (e.g., personality, behavioral tendencies) are popular in high-stakes organizational testing scenarios (e.g., informing hiring decisions) due to their enhanced resistance against response distortions (e.g., faking good, impression management). The measurement precisions of FC assessment scores used to inform personnel decisions are of paramount importance in practice. Different types of reliability estimates are reported for FC assessment scores in current publications, while consensus on best practices appears to be lacking. In order to provide understanding and structure around the reporting of FC reliability, this study systematically examined different types of reliability estimation methods for Thurstonian IRT-based FC assessment scores: their theoretical differences were discussed, and their numerical differences were illustrated through a series of simulations and empirical studies. In doing so, this study provides a practical guide for appraising different reliability estimation methods for IRT-based FC assessment scores.

Download Full-text

The Fear of COVID-19 Scale: A Reliability Generalization Meta-Analysis

Assessment ◽

10.1177/1073191121994164 ◽

2021 ◽

pp. 107319112199416

Author(s):

Desirée Blázquez-Rincón ◽

Juan I. Durán ◽

Juan Botella

Keyword(s):

Predictive Model ◽

Meta Analysis ◽

Classical Test Theory ◽

Measurement Model ◽

Rasch Measurement ◽

Test Theory ◽

Reliability Estimate ◽

Reliability Generalization ◽

Reliability Estimates ◽

Total Variability

A reliability generalization meta-analysis was carried out to estimate the average reliability of the seven-item, 5-point Likert-type Fear of COVID-19 Scale (FCV-19S), one of the most widespread scales developed around the COVID-19 pandemic. Different reliability coefficients from classical test theory and the Rasch Measurement Model were meta-analyzed, heterogeneity among the most reported reliability estimates was examined by searching for moderators, and a predictive model to estimate the expected reliability was proposed. At least one reliability estimate was available for a total of 44 independent samples out of 42 studies, being that Cronbach’s alpha was most frequently reported. The coefficients exhibited pooled estimates ranging from .85 to .90. The moderator analyses led to a predictive model in which the standard deviation of scores explained 36.7% of the total variability among alpha coefficients. The FCV-19S has been shown to be consistently reliable regardless of the moderator variables examined.

Download Full-text

A Test-Retest Reliability Generalization Meta-Analysis of Judgments Via the Policy-Capturing Technique

Organizational Research Methods ◽

10.1177/10944281211011529 ◽

2021 ◽

pp. 109442812110115

Author(s):

Ze Zhu ◽

Alan J. Tomassetti ◽

Reeshad S. Dalal ◽

Shannon W. Schrader ◽

Kevin Loo ◽

...

Keyword(s):

Best Practice ◽

Meta Analysis ◽

Temporal Stability ◽

Future Research ◽

Policy Capturing ◽

Reliability Estimate ◽

Reliability Generalization ◽

Retest Reliability ◽

Reliability Estimates ◽

Test Retest Reliability

Policy capturing is a widely used technique, but the temporal stability of policy-capturing judgments has long been a cause for concern. This article emphasizes the importance of reporting reliability, and in particular test-retest reliability, estimates in policy-capturing studies. We found that only 164 of 955 policy-capturing studies (i.e., 17.17%) reported a test-retest reliability estimate. We then conducted a reliability generalization meta-analysis on policy-capturing studies that did report test-retest reliability estimates—and we obtained an average reliability estimate of .78. We additionally examined 16 potential methodological and substantive antecedents to test-retest reliability (equivalent to moderators in validity generalization studies). We found that test-retest reliability was robust to variation in 14 of the 16 factors examined but that reliability was higher in paper-and-pencil studies than in web-based studies and was higher for behavioral intention judgments than for other (e.g., attitudinal and perceptual) judgments. We provide an agenda for future research. Finally, we provide several best-practice recommendations for researchers (and journal reviewers) with regard to (a) reporting test-retest reliability, (b) designing policy-capturing studies for appropriate reportage, and (c) properly interpreting test-retest reliability in policy-capturing studies.

Download Full-text