Metrics for Discrete Student Models: Chance Levels, Comparisons, and Use Cases

Metrics including Cohen’s kappa, precision, recall, and F1 are common measures of performance for models of discrete student states, such as a student’s affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on model performance. Simulated models also systematically showed the effects of imbalanced class distributions in both data and predictions, in terms of the values of metrics and the chance levels (values obtained by making random predictions) for those metrics. Random chance level for F1 was also established and evaluated. Results for example student models showed that over-prediction of the class of interest (positive class) was relatively common. Chance-level F1 was inflated by over-prediction; conversely, maximum possible values for F1 and kappa were negatively impacted by over-prediction of the positive class. Additionally, normalization methods for F1 relative to chance are discussed and compared to kappa, demonstrating an equivalence between kappa and normalized F1. Finally, implications of results for choice of metrics are discussed in the context of common student modelling goals, such as avoiding false negatives for student states that are negatively related to learning.

Download Full-text

K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.8.2.2020.89-93 ◽

2020 ◽

Vol 8 (2) ◽

pp. 89-93 ◽

Cited By ~ 3

Author(s):

Hairani Hairani ◽

Khurniawan Eko Saputro ◽

Sofiansyah Fadli

Keyword(s):

Naive Bayes ◽

Class Imbalance ◽

Naïve Bayes ◽

Data Sampling ◽

Minority Class ◽

Class A ◽

Positive Class ◽

Negative Class ◽

Imbalanced Class

The occurrence of imbalanced class in a dataset causes the classification results to tend to the class with the largest amount of data (majority class). A sampling method is needed to balance the minority class (positive class) so that the class distribution becomes balanced and leading to better classification results. This study was conducted to overcome imbalanced class problems on the Indian Pima diabetes illness dataset using k-means-SMOTE. The dataset has 268 instances of the positive class (minority class) and 500 instances of the negative class (majority class). The classification was done by comparing C4.5, SVM, and naïve Bayes while implementing k-means-SMOTE in data sampling. Using k-means-SMOTE, the SVM classification method has the highest accuracy and sensitivity of 82 % and 77 % respectively, while the naive Bayes method produces the highest specificity of 89 %.

Download Full-text

‘Do It Yourself’ Student Models for Collaborative Student Modelling and Peer Interaction

Intelligent Tutoring Systems - Lecture Notes in Computer Science ◽

10.1007/3-540-68716-5_23 ◽

1998 ◽

pp. 176-185 ◽

Cited By ~ 7

Author(s):

Susan Bull

Keyword(s):

Peer Interaction ◽

Student Modelling ◽

Student Models ◽

Do It Yourself

Download Full-text

Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301750 ◽

2019 ◽

Vol 33 ◽

pp. 750-757 ◽

Cited By ~ 5

Author(s):

Jill-Jenn Vie ◽

Hisashi Kashima

Keyword(s):

Latent Variable ◽

Factor Model ◽

Side Information ◽

Model Performance ◽

Student Knowledge ◽

Learning Platform ◽

Temporal Features ◽

Special Cases ◽

Knowledge Tracing ◽

Student Models

Knowledge tracing is a sequence prediction problem where the goal is to predict the outcomes of students over questions as they are interacting with a learning platform. By tracking the evolution of the knowledge of some student, one can optimize instruction. Existing methods are either based on temporal latent variable models, or factor analysis with temporal features. We here show that factorization machines (FMs), a model for regression or classification, encompasses several existing models in the educational literature as special cases, notably additive factor model, performance factor model, and multidimensional item response theory. We show, using several real datasets of tens of thousands of users and items, that FMs can estimate student knowledge accurately and fast even when student data is sparsely observed, and handle side information such as multiple knowledge components and number of attempts at item or skill level. Our approach allows to fit student models of higher dimension than existing models, and provides a testbed to try new combinations of features in order to improve existing models.

Download Full-text

Predictors of Purchasing a Hearing Aid After an Evaluation Period: A Prospective Study in Dutch Older Hearing Aid Candidates

American Journal of Audiology ◽

10.1044/2019_aja-heal18-18-0163 ◽

2019 ◽

Vol 28 (3S) ◽

pp. 802-805 ◽

Cited By ~ 1

Author(s):

Marieke Pronk ◽

Janine F. J. Meijerink ◽

Sophia E. Kramer ◽

Martijn W. Heymans ◽

Jana Besser

Keyword(s):

Help Seeking ◽

Hearing Aid ◽

Model Performance ◽

Area Under The Curve ◽

Secondary Data ◽

Support Program ◽

Evaluation Period ◽

A Prospective Study ◽

Study Participants ◽

Baseline Covariates

Purpose The current study aimed to identify factors that distinguish between older (50+ years) hearing aid (HA) candidates who do and do not purchase HAs after having gone through an HA evaluation period (HAEP). Method Secondary data analysis of the SUpport PRogram trial was performed ( n = 267 older, 1st-time HA candidates). All SUpport PRogram participants started an HAEP shortly after study enrollment. Decision to purchase an HA by the end of the HAEP was the outcome of interest of the current study. Participants' baseline covariates (22 in total) were included as candidate predictors. Multivariable logistic regression modeling (backward selection and reclassification tables) was used. Results Of all candidate predictors, only pure-tone average (average of 1, 2, and 4 kHz) hearing loss emerged as a significant predictor (odds ratio = 1.03, 95% confidence interval [1.03, 1.17]). Model performance was weak (Nagelkerke R 2 = .04, area under the curve = 0.61). Conclusions These data suggest that, once HA candidates have decided to enter an HAEP, factors measured early in the help-seeking journey do not predict well who will and will not purchase an HA. Instead, factors that act during the HAEP may hold this predictive value. This should be examined.

Download Full-text

Exploring the Validity and Operational Impact of Using Allied Health Assistants to Conduct Dysphagia Screening for Low-Risk Patients Within the Acute Hospital Setting

American Journal of Speech-Language Pathology ◽

10.1044/2020_ajslp-19-00060 ◽

2020 ◽

Vol 29 (4) ◽

pp. 1944-1955 ◽

Cited By ~ 1

Author(s):

Maria Schwarz ◽

Elizabeth C. Ward ◽

Petrea Cornwell ◽

Anne Coccetti ◽

Pamela D'Netto ◽

...

Keyword(s):

At Risk ◽

Allied Health ◽

Assessment Tool ◽

Hospital Setting ◽

Low Risk ◽

False Negatives ◽

Exact Agreement ◽

Risk Patients ◽

Impact Phase ◽

Operational Impact

Purpose The purpose of this study was to examine (a) the agreement between allied health assistants (AHAs) and speech-language pathologists (SLPs) when completing dysphagia screening for low-risk referrals and at-risk patients under a delegation model and (b) the operational impact of this delegation model. Method All AHAs worked in the adult acute inpatient settings across three hospitals and completed training and competency evaluation prior to conducting independent screening. Screening (pass/fail) was based on results from pre-screening exclusionary questions in combination with a water swallow test and the Eating Assessment Tool. To examine the agreement of AHAs' decision making with SLPs, AHAs ( n = 7) and SLPs ( n = 8) conducted an independent, simultaneous dysphagia screening on 51 adult inpatients classified as low-risk/at-risk referrals. To examine operational impact, AHAs independently completed screening on 48 low-risk/at-risk patients, with subsequent clinical swallow evaluation conducted by an SLP with patients who failed screening. Results Exact agreement between AHAs and SLPs on overall pass/fail screening criteria for the first 51 patients was 100%. Exact agreement for the two tools was 100% for the Eating Assessment Tool and 96% for the water swallow test. In the operational impact phase ( n = 48), 58% of patients failed AHA screening, with only 10% false positives on subjective SLP assessment and nil identified false negatives. Conclusion AHAs demonstrated the ability to reliably conduct dysphagia screening on a cohort of low-risk patients, with a low rate of false negatives. Data support high level of agreement and positive operational impact of using trained AHAs to perform dysphagia screening in low-risk patients.

Download Full-text

Relationships Between Subjective Ratings and Objective Measures of Performance in Speechreading Sentences

Journal of Speech Language and Hearing Research ◽

10.1044/jslhr.4004.900 ◽

1997 ◽

Vol 40 (4) ◽

pp. 900-911 ◽

Cited By ~ 6

Author(s):

Marilyn E. Demorest ◽

Lynne E. Bernstein

Keyword(s):

Hearing Impairment ◽

Performance Evaluations ◽

Objective Measures ◽

Large Individual ◽

Subjective Ratings ◽

Objective Performance ◽

Phonetic Distance ◽

Stimulus Length ◽

Measures Of Performance ◽

Confidence Scale

Ninety-six participants with normal hearing and 63 with severe-to-profound hearing impairment viewed 100 CID Sentences (Davis & Silverman, 1970) and 100 B-E Sentences (Bernstein & Eberhardt, 1986b). Objective measures included words correct, phonemes correct, and visual-phonetic distance between the stimulus and response. Subjective ratings were made on a 7-point confidence scale. Magnitude of validity coefficients ranged from .34 to .76 across materials, measures, and groups. Participants with hearing impairment had higher levels of objective performance, higher subjective ratings, and higher validity coefficients, although there were large individual differences. Regression analyses revealed that subjective ratings are predictable from stimulus length, response length, and objective performance. The ability of speechreaders to make valid performance evaluations was interpreted in terms of contemporary word recognition models.

Download Full-text

Constructing and Deconstructing Concepts

Experimental Psychology (formerly Zeitschrift für Experimentelle Psychologie) ◽

10.1027/1618-3169/a000337 ◽

2016 ◽

Vol 63 (5) ◽

pp. 249-262 ◽

Cited By ~ 2

Author(s):

Charles A. Doan ◽

Ronaldo Vigo

Keyword(s):

Information Theory ◽

Model Performance ◽

Unsupervised Classification ◽

Family Resemblance ◽

Category Membership ◽

Formal Models ◽

Human Beings ◽

One Dimensional ◽

Relational Patterns ◽

Optimal Classification

Abstract. Several empirical investigations have explored whether observers prefer to sort sets of multidimensional stimuli into groups by employing one-dimensional or family-resemblance strategies. Although one-dimensional sorting strategies have been the prevalent finding for these unsupervised classification paradigms, several researchers have provided evidence that the choice of strategy may depend on the particular demands of the task. To account for this disparity, we propose that observers extract relational patterns from stimulus sets that facilitate the development of optimal classification strategies for relegating category membership. We conducted a novel constrained categorization experiment to empirically test this hypothesis by instructing participants to either add or remove objects from presented categorical stimuli. We employed generalized representational information theory (GRIT; Vigo, 2011b , 2013a , 2014 ) and its associated formal models to predict and explain how human beings chose to modify these categorical stimuli. Additionally, we compared model performance to predictions made by a leading prototypicality measure in the literature.

Download Full-text

Clinically Meaningful Change

Methodology ◽

10.1027/1614-2241/a000168 ◽

2019 ◽

Vol 15 (3) ◽

pp. 97-105

Author(s):

Rodrigo Ferrer ◽

Antonio Pardo

Keyword(s):

Effect Size ◽

False Negative ◽

False Negative Rate ◽

Point Of View ◽

Skewed Distribution ◽

Effect Sizes ◽

False Negatives ◽

Large Size ◽

Before And After ◽

Post Test

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.

Download Full-text