Study of Existing Metrics Used in Measurement of Ideation Effectiveness

Author(s):  
Ramesh Srivathsavai ◽  
Nicole Genco ◽  
Katja Ho¨ltta¨-Otto ◽  
Carolyn C. Seepersad

In recent years, many new idea generation methods have been developed to generate innovative concepts. The effectiveness of those methods is evaluated by applying a set of metrics to the resulting concepts. Several metrics have been proposed for this purpose, including quality, novelty, and variety metrics, but the inter-rater reliability of those metrics has not been investigated extensively. In this paper, the inter-rater reliability of three existing metrics is analyzed by applying them to the results of a representative idea generation study. The effects on inter-rater agreement of analyzing concepts at the overall concept level versus the feature level are investigated, along with the impacts of alternative scales for specific metrics. In general, the inter-rater reliability of the metrics is found to be relatively low, with the most reliable results obtained at the feature level. The use of different scales also affects inter-rater reliability, but the effect is less significant. In addition to their low levels of repeatability, the metrics differ in how novelty is appraised.

2017 ◽  
Vol 181 (24) ◽  
pp. 655-655 ◽  
Author(s):  
Rafael Alzola Domingo ◽  
Chris M Riggs ◽  
David S Gardner ◽  
Sarah L Freeman

Superficial digital flexor tendon (SDFT) tendinopathy is an important musculoskeletal problem in horses. The study objective was to validate an ultrasonographic scoring system for SDFT injuries. Ultrasonographic images from 14 Thoroughbred racehorses with SDFT lesions (seven core; seven diffuse) and two controls were blindly assessed by five clinicians on two occasions. Ultrasonographic parameters evaluated were: type and extent of the injury, location, echogenicity, cross-sectional area and longitudinal fibre pattern of the maximal injury zone (MIZ). Inter-rater variability and intra-rater reliability were assessed using Kendall’s coefficient of concordance (KC) and Lin’s concordance correlation coefficient (LC), respectively. Type of injury (core vs. diffuse) had perfect inter/intra-rater agreement. Cases with core lesions had very strong inter-rater agreement (KC ≥0.74, P<0.001) and intra-rater reliability (LC ≥0.73) for all parameters apart from echogenicity. Cases with diffuse lesions had strong inter-rater agreement (KC ≥0.62) for all parameters, but weak agreement for echogenicity (KC=0.22); intra-rater reliability was excellent for MIZ location and fibre pattern (LC ≥0.82), and moderate (LC ≥0.58) for cross-sectional area and number of zones affected. This scoring system was reliable and repeatable for all parameters, except for echogenicity. A validated scoring system will facilitate reliable recording of SDFT injuries and inter-study meta-analyses.


2014 ◽  
Vol 26 (5) ◽  
pp. 825-836 ◽  
Author(s):  
Martin Nikolaus Dichter ◽  
Christian G. G. Schwab ◽  
Gabriele Meyer ◽  
Sabine Bartholomeyczik ◽  
Olga Dortmann ◽  
...  

ABSTRACTBackground:Quality of life (Qol) is an increasingly used outcome measure in dementia research. The QUALIDEM is a dementia-specific and proxy-rated Qol instrument. We aimed to determine the inter-rater and intra-rater reliability in residents with dementia in German nursing homes.Methods:The QUALIDEM consists of nine subscales that were applied to a sample of 108 people with mild to severe dementia and six consecutive subscales that were applied to a sample of 53 people with very severe dementia. The proxy raters were 49 registered nurses and nursing assistants. Inter-rater and intra-rater reliability scores were calculated on the subscale and item level.Results:None of the QUALIDEM subscales showed strong inter-rater reliability based on the single-measure Intra-Class Correlation Coefficient (ICC) for absolute agreement ≥ 0.70. Based on the average-measure ICC for four raters, eight subscales for people with mild to severe dementia (care relationship, positive affect, negative affect, restless tense behavior, social relations, social isolation, feeling at home and having something to do) and five subscales for very severe dementia (care relationship, negative affect, restless tense behavior, social relations and social isolation) yielded a strong inter-rater agreement (ICC: 0.72–0.86). All of the QUALIDEM subscales, regardless of dementia severity, showed strong intra-rater agreement. The ICC values ranged between 0.70 and 0.79 for people with mild to severe dementia and between 0.75 and 0.87 for people with very severe dementia.Conclusions:This study demonstrated insufficient inter-rater reliability and sufficient intra-rater reliability for all subscales of both versions of the German QUALIDEM. The degree of inter-rater reliability can be improved by collaborative Qol rating by more than one nurse. The development of a measurement manual with accurate item definitions and a standardized education program for proxy raters is recommended.


2019 ◽  
Vol 91 (1) ◽  
pp. 75-81 ◽  
Author(s):  
Leonhard A Bakker ◽  
Carin D Schröder ◽  
Harold H G Tan ◽  
Simone M A G Vugts ◽  
Ruben P A van Eijk ◽  
...  

ObjectiveThe Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) is widely applied to assess disease severity and progression in patients with motor neuron disease (MND). The objective of the study is to assess the inter-rater and intra-rater reproducibility, i.e., the inter-rater and intra-rater reliability and agreement, of a self-administration version of the ALSFRS-R for use in apps, online platforms, clinical care and trials.MethodsThe self-administration version of the ALSFRS-R was developed based on both patient and expert feedback. To assess the inter-rater reproducibility, 59 patients with MND filled out the ALSFRS-R online and were subsequently assessed on the ALSFRS-R by three raters. To assess the intra-rater reproducibility, patients were invited on two occasions to complete the ALSFRS-R online. Reliability was assessed with intraclass correlation coefficients, agreement was assessed with Bland-Altman plots and paired samples t-tests, and internal consistency was examined with Cronbach’s coefficient alpha.ResultsThe self-administration version of the ALSFRS-R demonstrated excellent inter-rater and intra-rater reliability. The assessment of inter-rater agreement demonstrated small systematic differences between patients and raters and acceptable limits of agreement. The assessment of intra-rater agreement demonstrated no systematic changes between time points; limits of agreement were 4.3 points for the total score and ranged from 1.6 to 2.4 points for the domain scores. Coefficient alpha values were acceptable.DiscussionThe self-administration version of the ALSFRS-R demonstrates high reproducibility and can be used in apps and online portals for both individual comparisons, facilitating the management of clinical care and group comparisons in clinical trials.


2014 ◽  
Author(s):  
Paul Walsh ◽  
Justin M. Thornton ◽  
Nicholas Walker ◽  
John Gary McCoy ◽  
Joe Baal ◽  
...  

Objectives To measure inter-rater agreement of overall clinical appearance of febrile children aged less than 24 months and to compare methods for doing so. Study Design and setting We performed an observational study of inter-rater reliability of the assessment of febrile children in a county hospital emergency department serving a mixed urban and rural population. Two emergency medicine healthcare providers independently evaluated the overall clinical appearance of children less than 24 months of age who had presented for fever. They recorded the initial ‘gestalt’ assessment of whether or not the child was ill appearing or if they were unsure. They then repeated this assessment after examining the child. Each rater was blinded to the other’s assessment. Our primary analysis was graphical. We also calculated Cohen’s κ, Gwet’s agreement coefficient and other measures of agreement and weighted variants of these. We examined the effect of time between exams and patient and provider characteristics on inter-rater agreement. Results We analyzed 159 of the 173 patients enrolled. Median age was 9.5 months (lower and upper quartiles 4.9-14.6), 99/159 (62%) were boys and 22/159 (14%) were admitted. Overall 118/159 (74%) and 119/159 (75%) were classified as well appearing on initial ‘gestalt’ impression by both examiners. Summary statistics varied from 0.223 for weighted κ to 0.635 for Gwet’s AC2. Inter rater agreement was affected by the time interval between the evaluations and the age of the child but not by the experience levels of the rater pairs. Classifications of ‘not ill appearing’ were more reliable than others. Conclusion The inter-rater reliability of emergency providers' assessment of overall clinical appearance was adequate when described graphically and by Gwet’s AC. Different summary statistics yield different results for the same dataset.


Stroke ◽  
2014 ◽  
Vol 45 (suppl_1) ◽  
Author(s):  
Christoph Griessenauer ◽  
Paul Foreman ◽  
Mohammadali Shoja ◽  
Kimberly Kicielinski ◽  
John Deveikis ◽  
...  

Background: Traumatic aneurysms occur in 10-20% of blunt traumatic extracranial carotid artery injuries. There is currently no standardized method for characterization of traumatic aneurysms. This study presents a systematic method for aneurysm characterization on both digital subtraction angiography (DSA) and CT angiography (CTA). Methods: Four raters, including one vascular neurosurgeon, one neuroradiologist, and two senior neurosurgical residents independently reviewed 15 CTAs and 13 DSAs obtained at the time of diagnosis of the traumatic aneurysm. Raters were asked to categorize the aneurysms as either ‘saccular’ or ‘fusiform’ and obtain measurements. Saccular aneurysm size was defined as the greatest linear distance between the expected location of the normal artery wall and the outer edge of the aneurysm lumen (‘depth’). Fusiform aneurysm size was defined as the depth and longitudinal extent (‘length’) parallel to the normal artery. The size of the aneurysm (‘aneurysm plus parent artery’) in relationship to the normal artery (‘parent artery’) was assessed as well. Assessments of five scans of each imaging modality were repeated for measurement of intra-rater reliability. Fleiss's free-marginal multi-rater kappa (κ), Cohen’s kappa (κ), and interclass correlation coefficient (ICC) were applied to determine inter- and intra-rater reliability. Results: Inter-rater agreement on aneurysm shape, ‘saccular’ versus ‘fusiform’, was almost perfect for CTA (κ = 0.82) and DSA (κ = 0.897). Agreement on aneurysm ‘depth’, ‘length’, ‘aneurysm plus parent artery’, and ‘parent artery’ for CTA and DSA were excellent (ICC > 0.75). Intra-rater agreement on aneurysm shape was substantial to almost perfect (κ > 0.6) in all four raters. Conclusions: This study demonstrates a clinically oriented, standardized method to characterize traumatic aneurysms with remarkable inter- and intra-rater reliability. This approach may help to define this disease entity more clearly and better understand the natural history. While certain characteristics of traumatic aneurysms may be associated with low risk and treatment with antithrombotic therapy may be sufficient, other characteristics may carry increased risk warranting endovascular repair.


Author(s):  
Nicole Banting ◽  
Emily K. Schaeffer ◽  
Jeffrey Bone ◽  
Eva Habib ◽  
Nikki Hooper ◽  
...  

Abstract Background Fractures through the physis account for 18–30% of paediatric fractures and can lead to growth arrest in 5–10% of these cases. Long-term radiographic follow-up is usually necessary to monitor for signs of growth arrest at the affected physis. Given plain radiographs of a physeal fracture obtained throughout patient follow-up, different surgeons may hold different opinions about whether or not early growth arrest has occurred despite using identical radiographs to guide decision-making. This study aims to assess the inter-rater and intra-rater reliability of early growth arrest diagnosis among orthopaedic surgeons given a set of identical plain radiographs. Methods A retrospective chart review was conducted on patients aged 2–18 years previously treated for a physeal fracture at a paediatric tertiary care hospital between 2011 and 2018. De-identified anteroposterior (AP) and lateral radiographs of 39 patients from the date of injury and minimum one-year post-injury were administered in a survey to international paediatric orthopaedic surgeons. Each surgeon was asked whether they would diagnose the patient with growth arrest based on the radiographs provided. Surgeons were asked to complete this process again two weeks after the initial review, but using identical shuffled radiographs. Inter-rater and intra-rater reliability was calculated using appropriate kappa statistics. Results A total of 11 paediatric orthopaedic surgeons completed the first round of the survey, and 9 of these 11 completed the second round. The inter-rater reliability for the first round was 0.22 [95% CI (0.06, 0.35)] and 0.21 [95% CI (0.02, 0.32)] for the second round. The average kappa for intra-rater reliability was − 0.05 [95% CI (− 0.31, 0.21)]. Comparison by injury side showed no significant variation in diagnosis {p = 0.509, OR = 0.90, [95% CI (0.67, 1.22)]}, while comparison by location of injury varied significantly (p = 0.003). Conclusions Radiographic diagnosis of growth arrest among paediatric orthopaedic surgeons demonstrated ‘fair’ inter-rater agreement and no intra-rater agreement, suggesting critical differences in identifying growth arrest on plain radiographs. Further research is necessary to develop an improved diagnostic approach for growth arrest among orthopaedic surgeons. Level of Evidence Diagnostic level III.


Author(s):  
Zukiswa Zingela ◽  
Louise Stroud ◽  
Johan Cronje ◽  
Max Fink ◽  
Stephan van Wyk

Abstract Background Clinical assessment of catatonia includes the use of diagnostic systems, such as the Diagnostic and Statistical Manual, Fifth Edition (DSM-5) and the International Classification of Disease, Tenth Revision (ICD-10), or screening tools such as the Bush Francis Catatonia Screening Instrument (BFCSI)/Bush Francis Catatonia Rating Scale (BFCRS) and the Braunig Catatonia Rating Scale. In this study, we describe the inter-rater reliability (IRR), utilizing the BFCSI, BFCRS, and DSM-5 to screen for catatonia. Methods Data from 10 participants recruited as part of a larger prevalence study (of 135 participants) were used to determine the IRR by five assessors after they were trained in the application of the 14-item BFCSI, 23-item BFCRS, and DSM-5 to assess catatonia in new admissions. Krippendorff’s α was used to compute the IRR, and Spearman’s correlation was used to determine the concordance between screening tools. The study site was a 35-bed acute mental health unit in Dora Nginza Hospital, Nelson Mandela Bay Metro. Participants were mostly involuntary admissions under the Mental Health Care Act of 2002 and between the ages of 13 and 65 years. Results Of the 135 participants, 16 (11.9%) had catatonia. The majority (92 [68.1%]) were between 16 and 35 years old, with 126 (93.3%) of them being Black and 89 (66.4%) being male. The BFCRS (complete 23-item scale) had the greatest level of inter-rater agreement with α = 0.798, while the DSM-5 had the lowest level of inter-rater agreement with α = 0.565. The highest correlation coefficients were observed between the BFCRS and the BFCSI. Conclusion The prevalence rate of catatonia was 11.9%, with the BFCSI and BFCRS showing the highest pick-up rate and a high IRR with high correlation coefficients, while the DSM-5 had deficiencies in screening for catatonia with low IRR and the lowest correlation with the other two tools.


2020 ◽  
Vol 29 (10) ◽  
pp. 2550-2559
Author(s):  
Said Sadiqi ◽  
Sander P. J. Muijs ◽  
Jeroen J. M. Renkens ◽  
Marcel W. Post ◽  
Lorin M. Benneker ◽  
...  

Abstract Purpose To report on the development of AOSpine CROST (Clinician Reported Outcome Spine Trauma) and results of an initial reliability study. Methods The AOSpine CROST was developed using an iterative approach of multiple cycles of development, review, and revision including an expert clinician panel. Subsequently, a reliability study was performed among an expert panel who were provided with 20 spine trauma cases, administered twice with 4-week interval. The results of the developmental process were analyzed using descriptive statistics, the reliability per parameter using Kappa statistics, inter-rater rater agreement using intraclass correlation coefficient (ICC), and internal consistency using Cronbach’s α. Results The AOSpine CROST was developed and consisted of 10 parameters, 2 of which are only applicable for surgically treated patents (‘Wound healing’ and ‘Implants’). A dichotomous scoring system (‘yes’ or ‘no’ response) was incorporated to express expected problems for the short term and long term. In the reliability study, 16 (84.2%) participated in the first round and 14 (73.7%) in the second. Intra-rater reliability was fair to good for both time points (κ = 0.40–0.80 and κ = 0.31–0.67). Results of inter-rater reliability were lower (κ = 0.18–0.60 and κ = 0.16–0.46). Inter-rater agreement for total scores showed moderate results (ICC = 0.52–0.60), and the internal consistency was acceptable (α = 0.76–0.82). Conclusions The AOSpine CROST, an outcome tool for the surgeons, was developed using an iterative process. An initial reliability analysis showed fair to moderate results and acceptable internal consistency. Further clinical validation studies will be performed to further validate the tool.


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Oddbjørn Klomsten Andersen ◽  
Siobhan A. O’Halloran ◽  
Elin Kolle ◽  
Nanna Lien ◽  
Jeroen Lakerveld ◽  
...  

Abstract Background Physical inactivity and unhealthy diet are key behavioral determinants underlying obesity. The neighborhood environment represents an important arena for modifying these behaviors, and hence reliable and valid tools to measure it are needed. Most existing virtual audit tools have been designed to assess either food or activity environments deemed relevant for adults. Thus, there is a need for a tool that combines the assessment of food and activity environments, and which focuses on aspects of the environment relevant for youth. Objective The aims of the present study were: (a) to adapt the SPOTLIGHT Virtual Audit Tool (S-VAT) developed to assess characteristics of the built environment deemed relevant for adults for use in an adolescent population, (b) to assess the tool’s inter- and intra-rater reliability, and (c) to assess its criterion validity by comparing the virtual audit to a field audit. Methods The tool adaptation was based on literature review and on results of a qualitative survey investigating how adolescents perceived the influence of the environment on dietary and physical activity behaviors. Sixty streets (148 street segments) in six neighborhoods were randomly selected as the study sample. Two raters assessed the inter- and intra-rater reliability and criterion validity, comparing the virtual audit tool to a field audit. The results were presented as percentage agreement and Cohen’s kappa (κ). Results Intra-rater agreement was found to be moderate to almost perfect (κ = 0.44–0.96) in all categories, except in the category aesthetics (κ = 0.40). Inter-rater agreement between auditors ranged from fair to substantial for all categories (κ = 0.24–0.80). Criterion validity was found to be moderate to almost perfect (κ = 0.56–0.82) for most categories, except aesthetics and grocery stores (κ = 0.26–0.35). Conclusion The adapted version of the S-VAT can be used to provide reliable and valid data on built environment characteristics deemed relevant for physical activity and dietary behavior among adolescents.


2021 ◽  
Vol 10 (13) ◽  
pp. 2990
Author(s):  
Min Cheol Chang ◽  
Changbae Lee ◽  
Donghwi Park

Background: the Videofluoroscopic Dysphagia Scale (VDS) is used to interpret and predict the long-term prognosis of patients with dysphagia. However, the inter-rater agreement of the VDS was shown to be lower in a previous study. To overcome the mentioned limitation of the VDS, a modified version (mVDS) was created and applied clinically. We aimed to validate its usefulness in determining the appropriate feeding method and predicting the prognosis of dysphagia. Methods: the videofluroscopic swallowing study (VFSS) data of 50 patients with dysphagia were collected retrospectively. The VFSS data were evaluated using the mVDS, and the inter-rater reliability was calculated. We also evaluated the association between the mVDS and type of feeding method selected, and between the mVDS and presence of aspiration pneumonia in patients with dysphagia. Results: among the different parameters of mVDS, “aspiration” showed the highest reliability (k = 0.767), followed by “mastication” and “lip closure” (k = 0.648 and k = 0.634, respectively). Conversely, “triggering pharyngeal swallow” and “pyriformis residue” demonstrated the lowest reliabilities (k = 0.312 and k = 0.324, respectively). The intraclass correlation coefficient (ICC), which is used as a measure of the reliability of the total mVDS score, was 0.876. In all patients with dysphagia, the mVDS score correlated significantly with the type of feeding method selected (p < 0.05), and the presence of aspiration pneumonia (p < 0.05). Conclusion: the ICC of the total mVDS score was 0.876. Therefore, the mVDS could be a useful tool for quantifying the severity of dysphagia. It could be helpful in the analysis of the VFSS findings among patients with dysphagia in clinical settings and research.


Sign in / Sign up

Export Citation Format

Share Document