scholarly journals On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining

2022 ◽  
Vol 6 (1) ◽  
pp. 6
Author(s):  
Gomathy Ramaswami ◽  
Teo Susnjak ◽  
Anuradha Mathrani

Poor academic performance of students is a concern in the educational sector, especially if it leads to students being unable to meet minimum course requirements. However, with timely prediction of students’ performance, educators can detect at-risk students, thereby enabling early interventions for supporting these students in overcoming their learning difficulties. However, the majority of studies have taken the approach of developing individual models that target a single course while developing prediction models. These models are tailored to specific attributes of each course amongst a very diverse set of possibilities. While this approach can yield accurate models in some instances, this strategy is associated with limitations. In many cases, overfitting can take place when course data is small or when new courses are devised. Additionally, maintaining a large suite of models per course is a significant overhead. This issue can be tackled by developing a generic and course-agnostic predictive model that captures more abstract patterns and is able to operate across all courses, irrespective of their differences. This study demonstrates how a generic predictive model can be developed that identifies at-risk students across a wide variety of courses. Experiments were conducted using a range of algorithms, with the generic model producing an effective accuracy. The findings showed that the CatBoost algorithm performed the best on our dataset across the F-measure, ROC (receiver operating characteristic) curve and AUC scores; therefore, it is an excellent candidate algorithm for providing solutions on this domain given its capabilities to seamlessly handle categorical and missing data, which is frequently a feature in educational datasets.

2020 ◽  
Vol 10 (11) ◽  
pp. 3998 ◽  
Author(s):  
Emanuel Marques Queiroga ◽  
João Ladislau Lopes ◽  
Kristofer Kappel ◽  
Marilton Aguiar ◽  
Ricardo Matsumura Araújo ◽  
...  

Contemporary education is a vast field that is concerned with the performance of education systems. In a formal e-learning context, student dropout is considered one of the main problems and has received much attention from the learning analytics research community, which has reported several approaches to the development of models for the early prediction of at-risk students. However, maximizing the results obtained by predictions is a considerable challenge. In this work, we developed a solution using only students’ interactions with the virtual learning environment and its derivative features for early predict at-risk students in a Brazilian distance technical high school course that is 103 weeks in duration. To maximize results, we developed an elitist genetic algorithm based on Darwin’s theory of natural selection for hyperparameter tuning. With the application of the proposed technique, we predicted the student at risk with an Area Under the Receiver Operating Characteristic Curve (AUROC) above 0.75 in the initial weeks of a course. The results demonstrate the viability of applying interaction count and derivative features to generate prediction models in contexts where access to demographic data is restricted. The application of a genetic algorithm to the tuning of hyperparameters classifiers can increase their performance in comparison with other techniques.


Author(s):  
Victor Alfonso Rodriguez ◽  
Shreyas Bhave ◽  
Ruijun Chen ◽  
Chao Pang ◽  
George Hripcsak ◽  
...  

Abstract Objective Coronavirus disease 2019 (COVID-19) patients are at risk for resource-intensive outcomes including mechanical ventilation (MV), renal replacement therapy (RRT), and readmission. Accurate outcome prognostication could facilitate hospital resource allocation. We develop and validate predictive models for each outcome using retrospective electronic health record data for COVID-19 patients treated between March 2 and May 6, 2020. Materials and Methods For each outcome, we trained 3 classes of prediction models using clinical data for a cohort of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2)–positive patients (n = 2256). Cross-validation was used to select the best-performing models per the areas under the receiver-operating characteristic and precision-recall curves. Models were validated using a held-out cohort (n = 855). We measured each model’s calibration and evaluated feature importances to interpret model output. Results The predictive performance for our selected models on the held-out cohort was as follows: area under the receiver-operating characteristic curve—MV 0.743 (95% CI, 0.682-0.812), RRT 0.847 (95% CI, 0.772-0.936), readmission 0.871 (95% CI, 0.830-0.917); area under the precision-recall curve—MV 0.137 (95% CI, 0.047-0.175), RRT 0.325 (95% CI, 0.117-0.497), readmission 0.504 (95% CI, 0.388-0.604). Predictions were well calibrated, and the most important features within each model were consistent with clinical intuition. Discussion Our models produce performant, well-calibrated, and interpretable predictions for COVID-19 patients at risk for the target outcomes. They demonstrate the potential to accurately estimate outcome prognosis in resource-constrained care sites managing COVID-19 patients. Conclusions We develop and validate prognostic models targeting MV, RRT, and readmission for hospitalized COVID-19 patients which produce accurate, interpretable predictions. Additional external validation studies are needed to further verify the generalizability of our results.


2020 ◽  
pp. 105477382098527
Author(s):  
Jane Flanagan ◽  
Marie Boltz ◽  
Ming Ji

We aimed to build a predictive model with intrinsic factors measured upon admission to skilled nursing facilities (SNFs) post-acute care (PAC) to identify older adults transferred from SNFs to long-term care (LTC) instead of home. We analyzed data from Massachusetts in 23,662 persons admitted to SNFs from PAC in 2013. Explanatory logistic regression analysis identified single “intrinsic predictors” related to LTC placement. To assess overfitting, the logistic regression predictive model was cross-validated and evaluated by its receiver operating characteristic (ROC) curve. A 12-variable predictive model with “intrinsic predictors” demonstrated both high in-sample and out-of-sample predictive accuracy in the receiver operating characteristic ROC and area under the ROC among patients at risk of LTC placement. This predictive model may be used for early identification of patients at risk for LTC after hospitalization in order to support targeted rehabilitative approaches and resource planning.


2020 ◽  
Author(s):  
Jun Ke ◽  
Yiwei Chen ◽  
Xiaoping Wang ◽  
Zhiyong Wu ◽  
qiongyao Zhang ◽  
...  

Abstract BackgroundThe purpose of this study is to identify the risk factors of in-hospital mortality in patients with acute coronary syndrome (ACS) and to evaluate the performance of traditional regression and machine learning prediction models.MethodsThe data of ACS patients who entered the emergency department of Fujian Provincial Hospital from January 1, 2017 to March 31, 2020 for chest pain were retrospectively collected. The study used univariate and multivariate logistic regression analysis to identify risk factors for in-hospital mortality of ACS patients. The traditional regression and machine learning algorithms were used to develop predictive models, and the sensitivity, specificity, and receiver operating characteristic curve were used to evaluate the performance of each model.ResultsA total of 7810 ACS patients were included in the study, and the in-hospital mortality rate was 1.75%. Multivariate logistic regression analysis found that age and levels of D-dimer, cardiac troponin I, N-terminal pro-B-type natriuretic peptide (NT-proBNP), lactate dehydrogenase (LDH), high-density lipoprotein (HDL) cholesterol, and calcium channel blockers were independent predictors of in-hospital mortality. The study found that the area under the receiver operating characteristic curve of the models developed by logistic regression, gradient boosting decision tree (GBDT), random forest, and support vector machine (SVM) for predicting the risk of in-hospital mortality were 0.963, 0.960, 0.963, and 0.959, respectively. Feature importance evaluation found that NT-proBNP, LDH, and HDL cholesterol were top three variables that contribute the most to the prediction performance of the GBDT model and random forest model.ConclusionsThe predictive model developed using logistic regression, GBDT, random forest, and SVM algorithms can be used to predict the risk of in-hospital death of ACS patients. Based on our findings, we recommend that clinicians focus on monitoring the changes of NT-proBNP, LDH, and HDL cholesterol, as this may improve the clinical outcomes of ACS patients.


2019 ◽  
Author(s):  
Karen-Inge Karstoft ◽  
Ioannis Tsamardinos ◽  
Kasper Eskelund ◽  
Søren Bo Andersen ◽  
Lars Ravnborg Nissen

BACKGROUND Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful. OBJECTIVE This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment. METHODS Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232). RESULTS Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive. CONCLUSIONS Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.


2019 ◽  
Vol 112 (3) ◽  
pp. 256-265 ◽  
Author(s):  
Yan Chen ◽  
Eric J Chow ◽  
Kevin C Oeffinger ◽  
William L Border ◽  
Wendy M Leisenring ◽  
...  

Abstract Background Childhood cancer survivors have an increased risk of heart failure, ischemic heart disease, and stroke. They may benefit from prediction models that account for cardiotoxic cancer treatment exposures combined with information on traditional cardiovascular risk factors such as hypertension, dyslipidemia, and diabetes. Methods Childhood Cancer Survivor Study participants (n = 22 643) were followed through age 50 years for incident heart failure, ischemic heart disease, and stroke. Siblings (n = 5056) served as a comparator. Participants were assessed longitudinally for hypertension, dyslipidemia, and diabetes based on self-reported prescription medication use. Half the cohort was used for discovery; the remainder for replication. Models for each outcome were created for survivors ages 20, 25, 30, and 35 years at the time of prediction (n = 12 models). Results For discovery, risk scores based on demographic, cancer treatment, hypertension, dyslipidemia, and diabetes information achieved areas under the receiver operating characteristic curve and concordance statistics 0.70 or greater in 9 and 10 of the 12 models, respectively. For replication, achieved areas under the receiver operating characteristic curve and concordance statistics 0.70 or greater were observed in 7 and 9 of the models, respectively. Across outcomes, the most influential exposures were anthracycline chemotherapy, radiotherapy, diabetes, and hypertension. Survivors were then assigned to statistically distinct risk groups corresponding to cumulative incidences at age 50 years of each target outcome of less than 3% (moderate-risk) or approximately 10% or greater (high-risk). Cumulative incidence of all outcomes was 1% or less among siblings. Conclusions Traditional cardiovascular risk factors remain important for predicting risk of cardiovascular disease among adult-age survivors of childhood cancer. These prediction models provide a framework on which to base future surveillance strategies and interventions.


Stroke ◽  
2021 ◽  
Vol 52 (1) ◽  
pp. 325-330
Author(s):  
Benjamin Hotter ◽  
Sarah Hoffmann ◽  
Lena Ulm ◽  
Christian Meisel ◽  
Alejandro Bustamante ◽  
...  

Background and Purpose: Several clinical scoring systems as well as biomarkers have been proposed to predict stroke-associated pneumonia (SAP). We aimed to externally and competitively validate SAP scores and hypothesized that 5 selected biomarkers would improve performance of these scores. Methods: We pooled the clinical data of 2 acute stroke studies with identical data assessment: STRAWINSKI and PREDICT. Biomarkers (ultrasensitive procalcitonin; mid-regional proadrenomedullin; mid-regional proatrionatriuretic peptide; ultrasensitive copeptin; C-terminal proendothelin) were measured from hospital admission serum samples. A literature search was performed to identify SAP prediction scores. We then calculated multivariate regression models with the individual scores and the biomarkers. Areas under receiver operating characteristic curves were used to compare discrimination of these scores and models. Results: The combined cohort consisted of 683 cases, of which 573 had available backup samples to perform the biomarker analysis. Literature search identified 9 SAP prediction scores. Our data set enabled us to calculate 5 of these scores. The scores had area under receiver operating characteristic curve of 0.543 to 0.651 for physician determined SAP, 0.574 to 0.685 for probable and 0.689 to 0.811 for definite SAP according to Pneumonia in Stroke Consensus group criteria. Multivariate models of the scores with biomarkers improved virtually all predictions, but mostly in the range of an area under receiver operating characteristic curve delta of 0.05. Conclusions: All SAP prediction scores identified patients who would develop SAP with fair to strong capabilities, with better discrimination when stricter criteria for SAP diagnosis were applied. The selected biomarkers provided only limited added predictive value, currently not warranting addition of these markers to prediction models. Registration: URL: https://www.clinicaltrials.gov . Unique identifier: NCT01264549 and NCT01079728.


2020 ◽  
Vol 10 (13) ◽  
pp. 4427 ◽  
Author(s):  
David Bañeres ◽  
M. Elena Rodríguez ◽  
Ana Elena Guerrero-Roldán ◽  
Abdulkadir Karadeniz

Artificial intelligence has impacted education in recent years. Datafication of education has allowed developing automated methods to detect patterns in extensive collections of educational data to estimate unknown information and behavior about the students. This research has focused on finding accurate predictive models to identify at-risk students. This challenge may reduce the students’ risk of failure or disengage by decreasing the time lag between identification and the real at-risk state. The contribution of this paper is threefold. First, an in-depth analysis of a predictive model to detect at-risk students is performed. This model has been tested using data available in an institutional data mart where curated data from six semesters are available, and a method to obtain the best classifier and training set is proposed. Second, a method to determine a threshold for evaluating the quality of the predictive model is established. Third, an early warning system has been developed and tested in a real educational setting being accurate and useful for its purpose to detect at-risk students in online higher education. The stakeholders (i.e., students and teachers) can analyze the information through different dashboards, and teachers can also send early feedback as an intervention mechanism to mitigate at-risk situations. The system has been evaluated on two undergraduate courses where results shown a high accuracy to correctly detect at-risk students.


2016 ◽  
Vol 101 (10) ◽  
pp. 3747-3754 ◽  
Author(s):  
Antonio León-Justel ◽  
Ainara Madrazo-Atutxa ◽  
Ana I. Alvarez-Rios ◽  
Rocio Infantes-Fontán ◽  
Juan A. Garcia-Arnés ◽  
...  

Context: Cushing’s syndrome (CS) is challenging to diagnose. Increased prevalence of CS in specific patient populations has been reported, but routine screening for CS remains questionable. To decrease the diagnostic delay and improve disease outcomes, simple new screening methods for CS in at-risk populations are needed. Objective: To develop and validate a simple scoring system to predict CS based on clinical signs and an easy-to-use biochemical test. Design: Observational, prospective, multicenter. Setting: Referral hospital. Patients: A cohort of 353 patients attending endocrinology units for outpatient visits. Interventions: All patients were evaluated with late-night salivary cortisol (LNSC) and a low-dose dexamethasone suppression test for CS. Main Outcome Measures: Diagnosis or exclusion of CS. Results: Twenty-six cases of CS were diagnosed in the cohort. A risk scoring system was developed by logistic regression analysis, and cutoff values were derived from a receiver operating characteristic curve. This risk score included clinical signs and symptoms (muscular atrophy, osteoporosis, and dorsocervical fat pad) and LNSC levels. The estimated area under the receiver operating characteristic curve was 0.93, with a sensitivity of 96.2% and specificity of 82.9%. Conclusions: We developed a risk score to predict CS in an at-risk population. This score may help to identify at-risk patients in non-endocrinological settings such as primary care, but external validation is warranted.


2016 ◽  
Vol 34 (20) ◽  
pp. 2366-2371 ◽  
Author(s):  
Arti Hurria ◽  
Supriya Mohile ◽  
Ajeet Gajra ◽  
Heidi Klepin ◽  
Hyman Muss ◽  
...  

Purpose Older adults are at increased risk for chemotherapy toxicity, and standard oncology assessment measures cannot identify those at risk. A predictive model for chemotherapy toxicity was developed (N = 500) that consisted of geriatric assessment questions and other clinical variables. This study aims to externally validate this model in an independent cohort (N = 250). Patients and Methods Patients age ≥ 65 years with a solid tumor, fluent in English, and who were scheduled to receive a new chemotherapy regimen were recruited from eight institutions. Risk of chemotherapy toxicity was calculated (low, medium, or high risk) on the basis of the prediction model before the start of chemotherapy. Chemotherapy-related toxicity was captured (grade 3 [hospitalization indicated], grade 4 [life threatening], and grade 5 [treatment-related death]). Validation of the prediction model was performed by calculating the area under the receiver-operating characteristic curve. Results The study sample (N = 250) had a mean age of 73 years (range, 65 to 94 [standard deviation, 5.8]). More than one half of patients (58%) experienced grade ≥ 3 toxicity. Risk of toxicity increased with increasing risk score (36.7% low, 62.4% medium, 70.2% high risk; P < .001). The area under the curve of the receiver-operating characteristic curve was 0.65 (95% CI, 0.58 to 0.71), which was not statistically different from the development cohort (0.72; 95% CI, 0.68 to 0.77; P = .09). There was no association between Karnofsky Performance Status and chemotherapy toxicity (P = .25). Conclusion This study externally validated a chemotherapy toxicity predictive model for older adults with cancer. This predictive model should be considered when discussing the risks and benefits of chemotherapy with older adults.


Sign in / Sign up

Export Citation Format

Share Document