scholarly journals Metrics for Discrete Student Models: Chance Levels, Comparisons, and Use Cases

2018 ◽  
Vol 5 (2) ◽  
Author(s):  
Nigel Bosch ◽  
Luc Paquette

Metrics including Cohen’s kappa, precision, recall, and F1 are common measures of performance for models of discrete student states, such as a student’s affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on model performance. Simulated models also systematically showed the effects of imbalanced class distributions in both data and predictions, in terms of the values of metrics and the chance levels (values obtained by making random predictions) for those metrics. Random chance level for F1 was also established and evaluated. Results for example student models showed that over-prediction of the class of interest (positive class) was relatively common. Chance-level F1 was inflated by over-prediction; conversely, maximum possible values for F1 and kappa were negatively impacted by over-prediction of the positive class. Additionally, normalization methods for F1 relative to chance are discussed and compared to kappa, demonstrating an equivalence between kappa and normalized F1. Finally, implications of results for choice of metrics are discussed in the context of common student modelling goals, such as avoiding false negatives for student states that are negatively related to learning.

2020 ◽  
Vol 8 (2) ◽  
pp. 89-93 ◽  
Author(s):  
Hairani Hairani ◽  
Khurniawan Eko Saputro ◽  
Sofiansyah Fadli

The occurrence of imbalanced class in a dataset causes the classification results to tend to the class with the largest amount of data (majority class). A sampling method is needed to balance the minority class (positive class) so that the class distribution becomes balanced and leading to better classification results. This study was conducted to overcome imbalanced class problems on the Indian Pima diabetes illness dataset using k-means-SMOTE. The dataset has 268 instances of the positive class (minority class) and 500 instances of the negative class (majority class). The classification was done by comparing C4.5, SVM, and naïve Bayes while implementing k-means-SMOTE in data sampling. Using k-means-SMOTE, the SVM classification method has the highest accuracy and sensitivity of 82 % and 77 % respectively, while the naive Bayes method produces the highest specificity of 89 %.


Author(s):  
Jill-Jenn Vie ◽  
Hisashi Kashima

Knowledge tracing is a sequence prediction problem where the goal is to predict the outcomes of students over questions as they are interacting with a learning platform. By tracking the evolution of the knowledge of some student, one can optimize instruction. Existing methods are either based on temporal latent variable models, or factor analysis with temporal features. We here show that factorization machines (FMs), a model for regression or classification, encompasses several existing models in the educational literature as special cases, notably additive factor model, performance factor model, and multidimensional item response theory. We show, using several real datasets of tens of thousands of users and items, that FMs can estimate student knowledge accurately and fast even when student data is sparsely observed, and handle side information such as multiple knowledge components and number of attempts at item or skill level. Our approach allows to fit student models of higher dimension than existing models, and provides a testbed to try new combinations of features in order to improve existing models.


2019 ◽  
Vol 28 (3S) ◽  
pp. 802-805 ◽  
Author(s):  
Marieke Pronk ◽  
Janine F. J. Meijerink ◽  
Sophia E. Kramer ◽  
Martijn W. Heymans ◽  
Jana Besser

Purpose The current study aimed to identify factors that distinguish between older (50+ years) hearing aid (HA) candidates who do and do not purchase HAs after having gone through an HA evaluation period (HAEP). Method Secondary data analysis of the SUpport PRogram trial was performed ( n = 267 older, 1st-time HA candidates). All SUpport PRogram participants started an HAEP shortly after study enrollment. Decision to purchase an HA by the end of the HAEP was the outcome of interest of the current study. Participants' baseline covariates (22 in total) were included as candidate predictors. Multivariable logistic regression modeling (backward selection and reclassification tables) was used. Results Of all candidate predictors, only pure-tone average (average of 1, 2, and 4 kHz) hearing loss emerged as a significant predictor (odds ratio = 1.03, 95% confidence interval [1.03, 1.17]). Model performance was weak (Nagelkerke R 2 = .04, area under the curve = 0.61). Conclusions These data suggest that, once HA candidates have decided to enter an HAEP, factors measured early in the help-seeking journey do not predict well who will and will not purchase an HA. Instead, factors that act during the HAEP may hold this predictive value. This should be examined.


2020 ◽  
Vol 29 (4) ◽  
pp. 1944-1955 ◽  
Author(s):  
Maria Schwarz ◽  
Elizabeth C. Ward ◽  
Petrea Cornwell ◽  
Anne Coccetti ◽  
Pamela D'Netto ◽  
...  

Purpose The purpose of this study was to examine (a) the agreement between allied health assistants (AHAs) and speech-language pathologists (SLPs) when completing dysphagia screening for low-risk referrals and at-risk patients under a delegation model and (b) the operational impact of this delegation model. Method All AHAs worked in the adult acute inpatient settings across three hospitals and completed training and competency evaluation prior to conducting independent screening. Screening (pass/fail) was based on results from pre-screening exclusionary questions in combination with a water swallow test and the Eating Assessment Tool. To examine the agreement of AHAs' decision making with SLPs, AHAs ( n = 7) and SLPs ( n = 8) conducted an independent, simultaneous dysphagia screening on 51 adult inpatients classified as low-risk/at-risk referrals. To examine operational impact, AHAs independently completed screening on 48 low-risk/at-risk patients, with subsequent clinical swallow evaluation conducted by an SLP with patients who failed screening. Results Exact agreement between AHAs and SLPs on overall pass/fail screening criteria for the first 51 patients was 100%. Exact agreement for the two tools was 100% for the Eating Assessment Tool and 96% for the water swallow test. In the operational impact phase ( n = 48), 58% of patients failed AHA screening, with only 10% false positives on subjective SLP assessment and nil identified false negatives. Conclusion AHAs demonstrated the ability to reliably conduct dysphagia screening on a cohort of low-risk patients, with a low rate of false negatives. Data support high level of agreement and positive operational impact of using trained AHAs to perform dysphagia screening in low-risk patients.


1997 ◽  
Vol 40 (4) ◽  
pp. 900-911 ◽  
Author(s):  
Marilyn E. Demorest ◽  
Lynne E. Bernstein

Ninety-six participants with normal hearing and 63 with severe-to-profound hearing impairment viewed 100 CID Sentences (Davis & Silverman, 1970) and 100 B-E Sentences (Bernstein & Eberhardt, 1986b). Objective measures included words correct, phonemes correct, and visual-phonetic distance between the stimulus and response. Subjective ratings were made on a 7-point confidence scale. Magnitude of validity coefficients ranged from .34 to .76 across materials, measures, and groups. Participants with hearing impairment had higher levels of objective performance, higher subjective ratings, and higher validity coefficients, although there were large individual differences. Regression analyses revealed that subjective ratings are predictable from stimulus length, response length, and objective performance. The ability of speechreaders to make valid performance evaluations was interpreted in terms of contemporary word recognition models.


Author(s):  
Charles A. Doan ◽  
Ronaldo Vigo

Abstract. Several empirical investigations have explored whether observers prefer to sort sets of multidimensional stimuli into groups by employing one-dimensional or family-resemblance strategies. Although one-dimensional sorting strategies have been the prevalent finding for these unsupervised classification paradigms, several researchers have provided evidence that the choice of strategy may depend on the particular demands of the task. To account for this disparity, we propose that observers extract relational patterns from stimulus sets that facilitate the development of optimal classification strategies for relegating category membership. We conducted a novel constrained categorization experiment to empirically test this hypothesis by instructing participants to either add or remove objects from presented categorical stimuli. We employed generalized representational information theory (GRIT; Vigo, 2011b , 2013a , 2014 ) and its associated formal models to predict and explain how human beings chose to modify these categorical stimuli. Additionally, we compared model performance to predictions made by a leading prototypicality measure in the literature.


Methodology ◽  
2019 ◽  
Vol 15 (3) ◽  
pp. 97-105
Author(s):  
Rodrigo Ferrer ◽  
Antonio Pardo

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.


Sign in / Sign up

Export Citation Format

Share Document