metric selection
Recently Published Documents


TOTAL DOCUMENTS

75
(FIVE YEARS 18)

H-INDEX

12
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Deho Oscar Blessed

With the widespread use of learning analytics (LA), ethical concerns about fairness havebeen raised. Research shows that LA models may be biased against students of certaindemographic groups. Although fairness has gained significant attention in the broadermachine learning (ML) community in the last decade, it is only recently that attentionhas been paid to fairness in LA. Furthermore, the decision on which unfairness mitigationalgorithm or metric to use in a particular context remains largely unknown. On thispremise, we performed a comparative evaluation of some selected unfairness mitigationalgorithms regarded in the fair ML community to have shown promising results. Using a3-year program dropout data from an Australian university, we comparatively evaluatedhow the unfairness mitigation algorithms contribute to ethical LA by testing for somehypotheses across fairness and performance metrics. Interestingly, our results show howdata bias does not always necessarily result in predictive bias. Perhaps not surprisingly,our test for fairness-utility tradeoff shows how ensuring fairness does not always lead todrop in utility. Indeed, our results show that ensuring fairness might lead to enhanced utilityunder specific circumstance. Our findings may to some extent, guide fairness algorithmand metric selection for a given context.


SLEEP ◽  
2021 ◽  
Author(s):  
Erika M Yamazaki ◽  
Courtney E Casale ◽  
Tess E Brieva ◽  
Caroline A Antler ◽  
Namni Goel

Abstract Study Objectives Sleep restriction (SR) and total sleep deprivation (TSD) reveal well-established individual differences in Psychomotor Vigilance Test (PVT) performance. While prior studies have used different methods to categorize such resiliency/vulnerability, none have systematically investigated whether these methods categorize individuals similarly. Methods 41 adults participated in a 13-day laboratory study consisting of 2 baseline, 5 SR, 4 recovery, and one 36h TSD night. The PVT was administered every 2h during wakefulness. Three approaches (Raw Score [average SR performance], Change from Baseline [average SR minus average baseline performance], and Variance [intraindividual variance of SR performance]), and within each approach, six thresholds (±1 standard deviation and the best/worst performing 12.5%, 20%, 25%, 33%, and 50%) classified Resilient/Vulnerable groups. Kendall’s tau-b correlations examined the concordance of group categorizations of approaches within and between PVT lapses and 1/reaction time (RT). Bias-corrected and accelerated bootstrapped t-tests compared group performance. Results Correlations comparing the approaches ranged from moderate to perfect for lapses and zero to moderate for 1/RT. Defined by all approaches, the Resilient groups had significantly fewer lapses on nearly all study days. Defined by the Raw Score approach only, the Resilient groups had significantly faster 1/RT on all study days. Between-measures comparisons revealed significant correlations between the Raw Score approach for 1/RT and all approaches for lapses. Conclusion The three approaches defining vigilant attention resiliency/vulnerability to sleep loss resulted in groups comprised of similar individuals for PVT lapses but not for 1/RT. Thus, both method and metric selection for defining vigilant attention resiliency/vulnerability to sleep loss is critical.


2021 ◽  
Author(s):  
Scott Graham ◽  
Trisha Ghotra

Background Recent advances in Artificial intelligence (AI) have the potential to substantially improve healthcare across clinical areas. However, there are concerns health AI research may overstate the utility of newly developed systems and that certain metrics for measuring AI system performance may lead to an overly optimistic interpretation of research results. The current study aims to evaluate the relationship between researcher choice of AI performance metric and promotional language use in published abstracts. Methods and findings This cross-sectional study evaluated the relationship between promotional language and use of composite performance metrics (AUC or F1). A total of 1200 randomly sampled health AI abstracts drawn from PubMed were evaluated for metric selection and promotional language rates. Promotional language evaluation was accomplished through the development of a customized machine learning system that identifies promotional claims in abstracts describing the results of health AI system development. The language classification system was trained with an annotated dataset of 922 sentences. Collected sentences were annotated by two raters for evidence of promotional language. The annotators achieved 94.5% agreement (κ = 0.825). Several candidate models were evaluated and, the bagged classification and regression tree (CART) achieved the highest performance at Precision = 0.92 and Recall = 0.89. The final model was used to classify individual sentences in a sample of 1200 abstracts, and a quasi-Poisson framework was used to assess the relationship between metric selection and promotional language rates. The results indicate that use of AUC predicts a 12% increase (95% CI: 5-19%, p = 0.00104) in abstract promotional language rates and that use of F1 predicts a 16% increase (95% CI: 4% to 30%, p = 0. 00996). Conclusions Clinical trials evaluating spin, hype, or overstatement have found that the observed magnitude of increase is sufficient to induce misinterpretation of findings in researchers and clinicians. These results suggest that efforts to address hype in health AI need to attend to both underlying research methods and language choice.


2021 ◽  
Author(s):  
Alexander Nauels ◽  
Carl-Friedrich Schleussner ◽  
Joeri Rogelj

<p>The treatment of non-CO<sub>2</sub> greenhouse gases is central for scientific assessments of effective climate change mitigation and climate policy. Radiative forcing of a unit of emitted short-lived gases decays quickly; on the order of a decade for methane, as opposed to centuries for CO<sub>2</sub>. Metric selection for comparing the climate effect of these emissions with CO<sub>2</sub> thereby comes with choices regarding short- vs. long-term priorities to achieve mitigation. The global nature of the well-mixed atmosphere also has implications for the transferability of concepts such as global warming potentials from the global to the national scale.</p><p>Here we present the implications of metric choice on global emissions balance and net zero, with a particular emphasis on the consistency with the wider context of the Paris Agreement, both on the global as well as the national level. Stylized scenarios show that interpreting the Paris Agreement emissions goals with metrics different from the IPCC AR5 can lead to inconsistencies with the Agreement’s temperature goal. Furthermore, we illustrate that introducing metrics that depend on historical emissions in a national context raises profound questions of equity and fairness, thereby questioning the applicability of non-constant global warming potentials at any but the global level. We provide suggestions to adequately approach these issues in the context of the Paris Agreement and national policy making.</p>


Sign in / Sign up

Export Citation Format

Share Document