A practical application of analysing weighted kappa for panels of experts and EQA schemes in pathology

2011 ◽  
Vol 64 (3) ◽  
pp. 257-260 ◽  
Author(s):  
Karen C Wright ◽  
Patricia Harnden ◽  
Sue Moss ◽  
Dan M Berney ◽  
Jane Melia

BackgroundKappa statistics are frequently used to analyse observer agreement for panels of experts and External Quality Assurance (EQA) schemes and generally treat all disagreements as total disagreement. However, the differences between ordered categories may not be of equal importance (eg, the difference between grades 1 vs 2 compared with 1 vs 3). Weighted kappa can be used to adjust for this when comparing a small number of readers, but this has not as yet been applied to the large number of readers typical of a national EQA scheme.AimTo develop and validate a method for applying weighted kappa to a large number of readers within the context of a real dataset: the UK National Urological Pathology EQA Scheme for prostatic biopsies.MethodsData on Gleason grade recorded by 19 expert readers were extracted from the fixed text responses of 20 cancer cases from four circulations of the EQA scheme. Composite kappa, currently used to compute an unweighted kappa for large numbers of readers, was compared with the mean kappa for all pairwise combinations of readers. Weighted kappa generalised for multiple readers was compared with the newly developed ‘pairwise-weighted’ kappa.ResultsFor unweighted analyses, the median increase from composite to pairwise kappa was 0.006 (range −0.005 to +0.052). The difference between the pairwise-weighted kappa and generalised weighted kappa for multiple readers never exceeded ±0.01.ConclusionPairwise-weighted kappa is a suitable and highly accurate approximation to weighted kappa for multiple readers.

2009 ◽  
Vol 46 (6) ◽  
pp. 648-653 ◽  
Author(s):  
Piotr Fudalej ◽  
Maria Hortis-Dzierzbicka ◽  
Zofia Dudkiewicz ◽  
Gunvor Semb

Objective: To compare the dental arch relationship following one-stage repair of unilateral cleft lip and palate (UCLP) in Warsaw with a matched sample of patients treated by the Oslo Cleft Team. Material: Study models of 61 children (mean age, 11.2; SD, 1.7) with a nonsyndromic complete UCLP consecutively treated with one-stage closure of the cleft at 9.2 months (range, 6.0 to 15.8 months; SD, 2.0) by the Warsaw Cleft Team at the Institute of Mother and Child, Poland, were compared with a sample drawn from a consecutive series of patients with UCLP treated by the Oslo Cleft Team and matched for age, gender, and soft tissue band. Methods: The study models were given random numbers to blind their origin. Four examiners rated the dental arch relationship using the GOSLON Yardstick. The strength of agreement of rating was assessed with weighted Kappa statistics. An independent t-test was carried out to compare the GOSLON scores between Warsaw and Oslo samples, and Fisher's exact tests were performed to evaluate the difference of distribution of the GOSLON scores. Results: The intrarater and interrater agreements were high (K ≥ .800). No difference in dental arch relationship between Warsaw and Oslo groups was found (mean GOSLON score  =  2.68 and 2.65 for Warsaw and Oslo samples, respectively). The distribution of the GOSLON grades was similar in both groups. Conclusions: The dental arch relationship following one-stage repair (Warsaw protocol) was comparable with the outcome of the Oslo Cleft Team's protocol.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 145-145 ◽  
Author(s):  
Mohamed L. Sorror ◽  
Fabiana Ostronoff ◽  
Rainer Storb ◽  
Smita Bhatia ◽  
Richard T. Maziarz ◽  
...  

Abstract Abstract 145 In 2005, the HCT-CI was introduced as a weighted scoring system to predict mortality risk following allogeneic HCT. Since then, not all investigators were able to validate the HCT-CI after testing in their respective institutions. In 2007, a collaborative multi-institutional study was initiated to investigate 1) whether the HCT-CI was predictive of outcomes across different institutions, 2) the degree of homogeneity of outcome prediction, and 3) the reasons for lack of agreement among investigators. To this end, data were collected from 3347 consecutive patients (pts) treated with allogeneic HCT between 2000 and 2006 from HLA-matched related or unrelated donors at 5 institutions. All data were collected by a single investigator, blinded from the final outcomes of pts, to ensure consistent comorbidity coding. Numbers of pts, percentages of available comorbidity data, and other transplant and pt characteristics were statistically significantly different among institutions (Table 1). Pts missing comorbidity or other covariate data were excluded from further analyses, yielding a final sample size of 2523.Table 1:Pre-transplant risk factors among the five institutionsInstitutionsA (n=1073), %B (n=973), %C (n=336), %D (n=237), %E (n=206), %pMissing comorbidity data<1202623<0.001HCT-CI scores    02930324232<0.001    1,23428292822    ≥33743393046Donor    Unrelated5038514031<0.001Age, years    ≥504229472151<0.001Conditioning Regimens    High-dose5367796746<0.001    Reduced-intensity1329101331    Nonmyeloablative344102123ATG in regimen11431514<0.001Diagnoses    Myeloid6356595751<0.001    Lymphoid2841382546    Other cancers23131    Non-malignant diseases702154Disease risk    High5962675167<0.001Stem cell source    Marrow1919245610<0.001Pt CMV    Positive5673706551<0.001KPS    ≤802918303825<0.001Prior regimens    ≥423222420300.25 Overall, pts with HCT-CI scores of 0 vs. 1–2 vs. ≥3 had 2-year non-relapse mortality (NRM) rates of 14%, 23%, and 39% (p <0.0001), respectively, and 2-year overall survival (OS) rates of 74%, 61%, and 39% (p <0.0001), respectively. Proportional hazards models were used to estimate the hazard ratio (HR) for NRM and OS associated with HCT-CI scores in each of the 5 institutions (Table 2). The models were adjusted for covariates in Table 1. Increased HCT-CI scores were associated with increases in the HR for NRM and OS across all 5 institutions and these increases were highly statistically significant except for institution E, which had the smallest sample size. Of note, the magnitudes of increases in HRs were not entirely comparable across institutions. In a unified model including all institutions, we found a statistically significant lack of homogeneity across institutions for the HRs associated with scores 1–2 (p=0.03) and ≥3 (p=0.04) for NRM and with scores ≥3 (p=0.01) for OS but not with scores 1–2 for OS (p=0.18). We also found a statistically significant, independent impact of institution on NRM (p=0.001) and OS (p<0.001).Table 2:Multivariate risk modelInstitutionsNRM HROverall survival HRHCT-CI scores01–2≥3p01–2≥3pA1.01.42.5<0.00011.01.362.23<0.0001B1.02.884.15<0.00011.01.882.77<0.0001C1.01.33.62<0.00011.01.333.28<0.0001D1.01.656.89<0.00011.01.845.81<0.0001E1.01.762.660.091.01.132.280.09 We then assessed, among 80 pts from institution A, the inter-observer variability in scoring comorbidity between two individual investigators and between each of them and unknown individuals from a pool of other evaluators. Weighted kappa statistics were highest (0.59) between two single evaluators and lowest between each and multiple evaluators (0.43 and 0.55, respectively). The principal investigator then developed a comprehensive guideline to code comorbidities and used it to train the other single investigator in a single session. Additional evaluation of inter-observer agreement demonstrated marked improvement of the weighted kappa statistic to 0.78. The reported disagreements on the validity of the HCT-CI may be explained by different institutional experiences in managing transplant pts, small number of pts at some institutions, and inter-observer variability in score assignment. The HCT-CI is valid to discriminate relative risks of mortalities after HCT across different institutions and should be used regularly for counseling pts and clinical trial design. Efforts to improve methods for coding comorbidity are in progress. Disclosures: No relevant conflicts of interest to declare.


2006 ◽  
Vol 45 (05) ◽  
pp. 541-547 ◽  
Author(s):  
P. Aubas ◽  
F. Seguret ◽  
A. Kramar ◽  
P. Dujols ◽  
D. Neveu

Summary Objectives: When two raters consider a qualitative variable ordered according to three categories, the qualitative agreement is commonly assessed with a symmetrically weighted kappa statistic. However, these statistics can present paradoxes, since they may be insensitive to variations of either complete agreements or disagreements. Methods: Agreement may be summarized by the relative amounts of complete agreements, partial and maximal disagreements beyond chance. Fixing the marginal totals and the trace, we computed symmetrically weighted kappa statistics and we developed a new statistic for qualitative agreements. Data sets from the literature were used to illustrate the methods. Results: We show that agreement may be better assessed with the unweighted kappa index, κc, and a new statistic ζ, which assesses the excess of maximal disagreements with respect to the partial ones, and does not depend on a particular weighting system. When ζis equal to zero, maximal and partial disagreements beyond chance are equal. With its estimated large sample variance, we compared the values of two contingency tables. Conclusions: The (κc, ζ) pair is sensitive to variations in agreements and/or disagreements and enables locating the difference between two qualitative agreements. The qualitative agreement is better with increasing values of κc and ζ.


2020 ◽  
Vol 14 (6) ◽  
pp. 529-536
Author(s):  
Jennifer C. Laine ◽  
Susan A. Novotny ◽  
Stefan Huhnstock ◽  
Andrew J. Ries ◽  
John E. Tis ◽  
...  

Purpose The modified lateral pillar classification (mLPC) is used for prognostication in the fragmentation stage of Legg Calvé Perthes disease. Previous reliability assessments of mLPC range from fair to good agreement when evaluated by a small number of observers with pre-selected radiographs. The purpose of this study was to determine the inter-observer and intra-observer reliability of mLPC performed by a group of international paediatric orthopaedic surgeons. Surgeons self-selected the radiograph for mLPC assessment, as would be done clinically. Methods In total, 40 Perthes cases with serial radiographs were selected. For each case, 26 surgeons independently selected a radiograph and assigned mLPC and 21 raters re-evaluated the same 40 cases to establish intra-observer reliability. Rater performance was determined through surgeon consensus using the mode mLPC as ‘gold standard’. Inter-observer and intra-observer reliability data were analysed using weighted kappa statistics. Results The weighted kappa for inter-observer correlation for mLPC was 0.64 (95% confidence interval: 0.55 to 0.74) and was 0.82 (range: 0.35 to 0.99) for intra-observer correlation. Individual surgeon’s overall performance varied from 48% to 88% agreement. Surgeon mLPC performance was not influenced by years of experience (p = 0.51). Radiograph selection did not influence gold standard assignment of mLPC. There was greater agreement on cases of mild B hips and severe C hips. Conclusions mLPC has low good inter-observer agreement when performed by a large number of surgeons with varied experience. Surgeons frequently chose different radiographs, with no impact on mLPC agreement. Further refinement is needed to help differentiate hips on the border of group B and C. Level of evidence III


2018 ◽  
Vol 1 (1) ◽  
pp. 6-21 ◽  
Author(s):  
I. K. Razumova ◽  
N. N. Litvinova ◽  
M. E. Shvartsman ◽  
A. Yu. Kuznetsov

Introduction. The paper presents survey results on the awareness towards and practice of Open Access scholarly publishing among Russian academics.Materials and Methods. We employed methods of statistical analysis of survey results. Materials comprise results of data processing of Russian survey conducted in 2018 and published results of the latest international surveys. The survey comprised 1383 respondents from 182 organizations. We performed comparative studies of the responses from academics and research institutions as well as different research areas. The study compares results obtained in Russia with the recently published results of surveys conducted in the United Kingdom and Europe.Results. Our findings show that 95% of Russian respondents support open access, 94% agree to post their publications in open repositories and 75% have experience in open access publishing. We did not find any difference in the awareness and attitude towards open access among seven reference groups. Our analysis revealed the difference in the structure of open access publications of the authors from universities and research institutes. Discussion andConclusions. Results reveal a high level of awareness and support to open access and succeful practice in the open access publications in the Russian scholarly community. The results for Russia demonstrate close similarity with the results of the UK academics. The governmental open access policies and programs would foster the practical realization of the open access in Russia.


2021 ◽  
Vol 256 ◽  
pp. 19-43
Author(s):  
Jennifer L. Castle ◽  
Jurgen A. Doornik ◽  
David F. Hendry

The Covid-19 pandemic has put forecasting under the spotlight, pitting epidemiological models against extrapolative time-series devices. We have been producing real-time short-term forecasts of confirmed cases and deaths using robust statistical models since 20 March 2020. The forecasts are adaptive to abrupt structural change, a major feature of the pandemic data due to data measurement errors, definitional and testing changes, policy interventions, technological advances and rapidly changing trends. The pandemic has also led to abrupt structural change in macroeconomic outcomes. Using the same methods, we forecast aggregate UK unemployment over the pandemic. The forecasts rapidly adapt to the employment policies implemented when the UK entered the first lockdown. The difference between our statistical and theory based forecasts provides a measure of the effect of furlough policies on stabilising unemployment, establishing useful scenarios had furlough policies not been implemented.


2021 ◽  
pp. 1-27
Author(s):  
Sonia Oreffice ◽  
Climent Quintana-Domeque

Abstract We investigate gender differences across multiple dimensions after 3 months of the first UK lockdown of March 2020, using an online sample of approximately 1,500 Prolific respondents’ residents in the UK. We find that women's mental health was worse than men along the four metrics we collected data on, that women were more concerned about getting and spreading the virus, and that women perceived the virus as more prevalent and lethal than men did. Women were also more likely to expect a new lockdown or virus outbreak by the end of 2020, and were more pessimistic about the contemporaneous and future state of the UK economy, as measured by their forecasted contemporaneous and future unemployment rates. We also show that between earlier in 2020 before the outbreak of the Coronavirus pandemic and June 2020, women had increased childcare and housework more than men. Neither the gender gaps in COVID-19-related health and economic concerns nor the gender gaps in the increase in hours of childcare and housework can be accounted for by a rich set of control variables. Instead, we find that the gender gap in mental health can be partially accounted for by the difference in COVID-19-related health concerns between men and women.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Renata Zelic ◽  
Francesca Giunchi ◽  
Luca Lianas ◽  
Cecilia Mascia ◽  
Gianluigi Zanetti ◽  
...  

AbstractVirtual microscopy (VM) holds promise to reduce subjectivity as well as intra- and inter-observer variability for the histopathological evaluation of prostate cancer. We evaluated (i) the repeatability (intra-observer agreement) and reproducibility (inter-observer agreement) of the 2014 Gleason grading system and other selected features using standard light microscopy (LM) and an internally developed VM system, and (ii) the interchangeability of LM and VM. Two uro-pathologists reviewed 413 cores from 60 Swedish men diagnosed with non-metastatic prostate cancer 1998–2014. Reviewer 1 performed two reviews using both LM and VM. Reviewer 2 performed one review using both methods. The intra- and inter-observer agreement within and between LM and VM were assessed using Cohen’s kappa and Bland and Altman’s limits of agreement. We found good repeatability and reproducibility for both LM and VM, as well as interchangeability between LM and VM, for primary and secondary Gleason pattern, Gleason Grade Groups, poorly formed glands, cribriform pattern and comedonecrosis but not for the percentage of Gleason pattern 4. Our findings confirm the non-inferiority of VM compared to LM. The repeatability and reproducibility of percentage of Gleason pattern 4 was poor regardless of method used warranting further investigation and improvement before it is used in clinical practice.


2020 ◽  
pp. jech-2020-214770
Author(s):  
Elizabeth Richardson ◽  
Martin Taulbut ◽  
Mark Robinson ◽  
Andrew Pulford ◽  
Gerry McCartney

BackgroundLife expectancy (LE) improvements have stalled, and UK tax and welfare ‘reforms’ have been proposed as a cause. We estimated the effects of tax and welfare reforms from 2010/2011 to 2021/2022 on LE and inequalities in LE in Scotland.MethodsWe applied a published estimate of the cumulative income impact of the reforms to the households within Scottish Index of Multiple Deprivation (SIMD) quintiles. We estimated the impact on LE by applying a rate ratio for the impact of income on mortality rates (by age group, sex and SIMD quintile) and calculating the difference between inflation-only changes in benefits and the reforms.ResultsWe estimated that changes to household income resulting from the reforms would result in an additional 1041 (+3.7%) female deaths and 1013 (+3.8%) male deaths. These deaths represent an estimated reduction of female LE from 81.6 years to 81.2 years (−20 weeks), and male LE from 77.6 years to 77.2 years (−23 weeks). Cuts to benefits and tax credits were modelled to have the most detrimental impact on LE, and these were estimated to be most severe in the most deprived areas. The modelled impact on inequalities in LE was widening of the gap between the most and least deprived 20% of areas by a further 21 weeks for females and 23 weeks for males.InterpretationThis study provides further evidence that austerity, in the form of cuts to social security benefits, is likely to be an important cause of stalled LE across the UK.


Sign in / Sign up

Export Citation Format

Share Document