Measuring Agreement for Ordered Ratings in 3 x 3 Tables

2006 ◽  
Vol 45 (05) ◽  
pp. 541-547 ◽  
Author(s):  
P. Aubas ◽  
F. Seguret ◽  
A. Kramar ◽  
P. Dujols ◽  
D. Neveu

Summary Objectives: When two raters consider a qualitative variable ordered according to three categories, the qualitative agreement is commonly assessed with a symmetrically weighted kappa statistic. However, these statistics can present paradoxes, since they may be insensitive to variations of either complete agreements or disagreements. Methods: Agreement may be summarized by the relative amounts of complete agreements, partial and maximal disagreements beyond chance. Fixing the marginal totals and the trace, we computed symmetrically weighted kappa statistics and we developed a new statistic for qualitative agreements. Data sets from the literature were used to illustrate the methods. Results: We show that agreement may be better assessed with the unweighted kappa index, κc, and a new statistic ζ, which assesses the excess of maximal disagreements with respect to the partial ones, and does not depend on a particular weighting system. When ζis equal to zero, maximal and partial disagreements beyond chance are equal. With its estimated large sample variance, we compared the values of two contingency tables. Conclusions: The (κc, ζ) pair is sensitive to variations in agreements and/or disagreements and enables locating the difference between two qualitative agreements. The qualitative agreement is better with increasing values of κc and ζ.

Healthcare ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 810
Author(s):  
Areej Y. Bayahya ◽  
Wadee Alhalabi ◽  
Sultan H. AlAmri

Smart health technology includes physical sensors, intelligent sensors, and output advice to help monitor patients’ health and adjust their behavior. Virtual reality (VR) plays an increasingly larger role to improve health outcomes, being used in a variety of medical specialties including robotic surgery, diagnosis of some difficult diseases, and virtual reality pain distraction for severe burn patients. Smart VR health technology acts as a decision support system in the diseases diagnostic test of patients as they perform real world tasks in virtual reality (e.g., navigation). In this study, a non-invasive, cognitive computerized test based on 3D virtual environments for detecting the main symptoms of dementia (memory loss, visuospatial defects, and spatial navigation) is proposed. In a recent study, the system was tested on 115 real patients of which thirty had a dementia, sixty-five were cognitively healthy, and twenty had a mild cognitive impairment (MCI). The performance of the VR system was compared with Mini-Cog test, where the latter is used to measure cognitive impaired patients in the traditional diagnosis system at the clinic. It was observed that visuospatial and memory recall scores in both clinical diagnosis and VR system of dementia patients were less than those of MCI patients, and the scores of MCI patients were less than those of the control group. Furthermore, there is a perfect agreement between the standard methods in functional evaluation and navigational ability in our system where P-value in weighted Kappa statistic= 100% and between Mini-Cog-clinical diagnosis vs. VR scores where P-value in weighted Kappa statistic= 93%.


2009 ◽  
Vol 46 (6) ◽  
pp. 648-653 ◽  
Author(s):  
Piotr Fudalej ◽  
Maria Hortis-Dzierzbicka ◽  
Zofia Dudkiewicz ◽  
Gunvor Semb

Objective: To compare the dental arch relationship following one-stage repair of unilateral cleft lip and palate (UCLP) in Warsaw with a matched sample of patients treated by the Oslo Cleft Team. Material: Study models of 61 children (mean age, 11.2; SD, 1.7) with a nonsyndromic complete UCLP consecutively treated with one-stage closure of the cleft at 9.2 months (range, 6.0 to 15.8 months; SD, 2.0) by the Warsaw Cleft Team at the Institute of Mother and Child, Poland, were compared with a sample drawn from a consecutive series of patients with UCLP treated by the Oslo Cleft Team and matched for age, gender, and soft tissue band. Methods: The study models were given random numbers to blind their origin. Four examiners rated the dental arch relationship using the GOSLON Yardstick. The strength of agreement of rating was assessed with weighted Kappa statistics. An independent t-test was carried out to compare the GOSLON scores between Warsaw and Oslo samples, and Fisher's exact tests were performed to evaluate the difference of distribution of the GOSLON scores. Results: The intrarater and interrater agreements were high (K ≥ .800). No difference in dental arch relationship between Warsaw and Oslo groups was found (mean GOSLON score  =  2.68 and 2.65 for Warsaw and Oslo samples, respectively). The distribution of the GOSLON grades was similar in both groups. Conclusions: The dental arch relationship following one-stage repair (Warsaw protocol) was comparable with the outcome of the Oslo Cleft Team's protocol.


2003 ◽  
Vol 1860 (1) ◽  
pp. 103-108 ◽  
Author(s):  
Shawn Landers ◽  
Wael Bekheet ◽  
Lynne Falls

Like many provincial and municipal agencies, the British Columbia Ministry of Transportation (BCMoT) contracts out the collection of pavement surface condition data. Because BCMoT is committed to contracts with multiple private contractors, quality assurance (QA) plays a critical role in ensuring that the data are collected accurately and repeatably from year to year. Comprehensive QA testing procedures for surface distress data have been developed and implemented since the data collection has been based on visual ratings with event boards. Control sites that are manually surveyed are used to evaluate whether the contractor is correctly applying the BCMoT pavement surface distress rating system. To date, the QA testing has been based on a composite-index–based criterion for assessing the level of agreement and supplemented with the detailed severity and density rating data. However, the use of a composite index presents some limitations related to the model formulation and weightings assigned to particular distress types. Although the detailed ratings are useful as a diagnostic tool to pinpoint discrepancies, in the disaggregated format, they are not conducive as acceptance criteria for QA testing. Not widely used in the field of engineering, Cohen’s weighted kappa statistic has been applied since the 1960s in other areas to assess the level of agreement beyond chance among raters. The statistic was therefore identified as a possible solution for improving the ministry’s QA surface distress testing process by providing an overall measure of the level of agreement between the detailed manual benchmark survey and the contractor severity and density ratings. The application is described of Cohen’s weighted kappa statistic for visual surface distress survey QA testing using the BCMoT survey and testing procedures as a case study.


2000 ◽  
Vol 15 (S1) ◽  
pp. 29-33 ◽  
Author(s):  
E. Van Horn ◽  
C. Manley ◽  
D. Leddy ◽  
D. Cicchetti ◽  
P. Tyrer

SummaryPurposeTo assess the validity of a quick assessment instrument (10 minutes) for assessing personality status, the Rapid Personality Assessment Schedule (PAS-R).Subjects and methodsThe PAS-R was evaluated in psychotic patients recruited in one of the centres involved in a multicentre randomised controlled trial of intensive vs standard case management (the UK700 case management trial). Patients were assessed using both a full version of the PAS (PAS-I – ICD version) and the PAS-R. The weighted kappa statistic was used to gauge the (criterion-related) validity of the PAS-R using the PAS-I as the gold standard. Both measure code personality status using a four-point rating of severity in addition to recording individual categories of personality disorder.ResultsOne hundred fifty-five (77%) of 201 patients recruited were assessed with both instruments. The weighted kappa statistic was 0.31, suggesting only moderate agreement between the PAS-I and PAS-R instruments under the four-point rating format, and 0.39 for the dichotomous personality disorder/no disorder separation. The sensitivity (64%) and specificity (82%) of the PAS-R in predicting PAS-I personality disorder were as satisfactory as for other screening instruments but still somewhat disappointing, and the PAS-R had an overall diagnostic accuracy of 78%.ConclusionThe PAS-R is a quick and rough method of detecting personality abnormality but is not a substitute for a fuller assessment.


Blood ◽  
2013 ◽  
Vol 122 (21) ◽  
pp. 1108-1108
Author(s):  
Abigail T. Lang ◽  
Linda P. Grooms ◽  
Mollie Sturm ◽  
Michelle Walsh ◽  
Terah Koch ◽  
...  

Abstract Background The introduction of bleeding assessment tools (BATs) to quantify the presence and severity of commonly reported bleeding symptoms has received increased interest over the past decade. Bleeding scores, along with laboratory data and family history, can assist the clinician in the assessment of a suspected mild bleeding disorder (MBD). While clinician-administered BATs have been utilized frequently, implementation and validation of the accuracy of a self-report or parent-proxy BAT have yet to be investigated. The primary objective of this study was to determine the accuracy of a parent-administered BAT by measuring the level of agreement between parent and clinician responses to the Condensed MCMDM-1VWD Bleeding Questionnaire. Methods Our study population included children aged 0-19 years presenting to the hematology clinic at Nationwide Children's Hospital (Columbus, OH) for initial evaluation of a suspected MBD or for follow-up evaluation of a previously diagnosed MBD. At the time of the visit, the parent/caregiver completed a short demographic survey and a modified version (targeted for a 6th grade comprehension level) of the Condensed MCMDM-1VWD Bleeding Questionnaire. The treating provider also completed the BAT by interviewing the patient and his/her caregiver; clinicians were blinded to the results of the parent BAT. Both the parent and clinician versions of the BAT were scored and analyzed in the same manner for ease of comparison. We calculated the percentage of agreement and weighted kappa statistic for individual bleeding symptoms as well as the mean across all questionnaire items. We also examined the agreement between caregiver and clinician responses in regards to patient age, gender, diagnosis (new versus follow-up patient), and parent education level. Results To date, we have enrolled 55 eligible patients. The overall mean bleeding score (BS) as calculated from the parent-report BAT was 5.98 (range: -1-25), while the mean BS for the clinician-report BAT was 3.87 (range: 0-16). The mean percentage of agreement between parents and clinicians across all items was 76% (range: 58-98%). The mean weighted kappa statistic was 0.31 (range: -0.04-0.79), representing fair agreement (based on Landis and Koch criteria); the mean Gwet's AC1 (an alternative kappa statistic) was 0.72 (range: 0.48-0.98), representing substantial agreement. Overall, 20% of parent and clinician total bleeding scores matched exactly, and an additional 42% of parent and clinician scores varied by only one to two points. 82% of the study population had an abnormal total bleeding score (defined as ≥2) when rated by parents and 78% had an abnormal total score when rated by clinicians (82% agreement, kappa = 0.43, Gwet's AC1 = 0.73). Tests for equal kappa coefficients did not show significant differences in agreement between parents and clinicians when compared by patient gender, age, diagnosis, or parent education level. Discussion To our knowledge, the results of a patient and/or parent-administered BAT score have not been studied to determine their accuracy and feasibility of use as a screening method for patients with a suspected MBD. While parents tended to over-report bleeding as compared to clinicians, overall, parent and clinician bleeding scores were similar in our study, and these results lend support for the potential use of a modified proxy-report BAT in a clinic setting. Additional research into the construct of the parent-administered BAT is needed to further improve the accuracy of parent-reported bleeding symptoms. Disclosures: Lang: OSUCOM Bennett Medical Student Research Scholarship: Research Funding; ASH HONORS Award: Research Funding.


1989 ◽  
Vol 9 (1) ◽  
pp. 67-74 ◽  
Author(s):  
Donald W. Stewart ◽  
David Koulack

Retrospective dream reports from 179 undergraduate students were scored by two independent raters in an attempt to assess the reliability of a newly-developed rating system for lucid dream content. The weighted Kappa statistic, which provides an index of chance-corrected interrater agreement for qualitative data, was used to assess the reliability of the ratings. The results indicated that the lucid dream rating system could be reliably used to identify different types of lucid dream content in dream reports, but some categories were less efficacious than others. Suggested revisions to the rating system are discussed.


1995 ◽  
Vol 6 (2) ◽  
pp. 90-95 ◽  
Author(s):  
Martin C Tammemagi ◽  
John W Frank ◽  
Michael LeBlanc ◽  
Harvey Artsob

Objectives: Lyme disease has been increasingly diagnosed throughout North America since the late 1970s. The clinical diagnosis and epidemiological monitoring of Lyme disease are aided by serological testing for the etiological agent,Borrelia burgdorferi. Numerous authorities have questioned the reproducibility of these serological tests. This study assessed the intra- and interlaboratory reproducibility of anelisaused to aid in the diagnosis of Lyme disease.Methods: Twenty-seven sera from cases and noncases were tested by three laboratories. Two of the laboratories repeated the tests once. These testings were part of the 1991 quality control assessment of provincial laboratories carried out by the Laboratory Centre for Disease Control (lcdc), Ottawa.Results: The mean weighted kappa statistics were 0.87 for interlaboratory comparisons and 0.89 for intralaboratory comparisons.Conclusions: Overall, theelisaassessed in this study demonstrated good to excellent intra- and interlaboratory reproducibility in thelcdc1991 quality control assessment when the data were assessed in the categorical scale using the weighted kappa statistic. Generalization of these findings to clinical laboratory settings must be done with caution.


2013 ◽  
Vol 8 (2) ◽  
pp. 254
Author(s):  
Carol Perryman

Objectives – To compare PubMed and Google Scholar results for content relevance and article quality Design – Bibliometric study. Setting – Department of Internal Medicine at Texas Tech University Health Sciences Center. Methods – Four clinical searches were conducted in both PubMed and Google Scholar. Search methods were described as “real world” (p. 216) behaviour, with the searchers familiar with content, though not expert at retrieval techniques. The first 20 results from each search were evaluated for relevance to the initial question, as well as for quality. Relevance was determined based on one author’s subjective assessment of information in the title and abstract, when available, and then tested by two other authors, with discrepancies discussed and resolved. Items were assigned to one of three categories: relevant, possibly relevant, and not relevant to the question, with reviewer agreement measured using a weighted kappa statistic. The quality of items found to be ‘relevant’ and ‘possibly relevant’ was measured by impact factor ratings from Thomsen Reuters (ISI) Web of Knowledge, when available, as well as information obtained by SCOPUS on the number of times items were cited. Main Results – Google Scholar results were judged to be more relevant and of higher quality than results obtained from PubMEed. Google Scholar results are also older on average, while PubMed retrieved items from a larger number of unique journals. Conclusion – In agreement with earlier research, the authors recommended that searchers use both PubMed and Google Scholar to improve on the quality and relevance of results. Searches in the two resources identify unique items based upon the ranking algorithms involved.


Sign in / Sign up

Export Citation Format

Share Document