Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities

Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict first-year engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradient-boosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.

Download Full-text

Prognostic Machine Learning Models for First-Year Mortality in Incident Hemodialysis Patients: Development and Validation Study (Preprint)

10.2196/preprints.20578 ◽

2020 ◽

Author(s):

Kaixiang Sheng ◽

Ping Zhang ◽

Xi Yao ◽

Jiawei Li ◽

Yongchun He ◽

...

Keyword(s):

Machine Learning ◽

High Risk ◽

Dialysis Initiation ◽

Gradient Boosting ◽

First Year ◽

Learning Models ◽

High Risk Patients ◽

Extreme Gradient Boosting ◽

Risk Patients ◽

Machine Learning Models

BACKGROUND The first-year survival rate among patients undergoing hemodialysis remains poor. Current mortality risk scores for patients undergoing hemodialysis employ regression techniques and have limited applicability and robustness. OBJECTIVE We aimed to develop a machine learning model utilizing clinical factors to predict first-year mortality in patients undergoing hemodialysis that could assist physicians in classifying high-risk patients. METHODS Training and testing cohorts consisted of 5351 patients from a single center and 5828 patients from 97 renal centers undergoing hemodialysis (incident only). The outcome was all-cause mortality during the first year of dialysis. Extreme gradient boosting was used for algorithm training and validation. Two models were established based on the data obtained at dialysis initiation (model 1) and data 0-3 months after dialysis initiation (model 2), and 10-fold cross-validation was applied to each model. The area under the curve (AUC), sensitivity (recall), specificity, precision, balanced accuracy, and F1 score were used to assess the predictive ability of the models. RESULTS In the training and testing cohorts, 585 (10.93%) and 764 (13.11%) patients, respectively, died during the first-year follow-up. Of 42 candidate features, the 15 most important features were selected. The performance of model 1 (AUC 0.83, 95% CI 0.78-0.84) was similar to that of model 2 (AUC 0.85, 95% CI 0.81-0.86). CONCLUSIONS We developed and validated 2 machine learning models to predict first-year mortality in patients undergoing hemodialysis. Both models could be used to stratify high-risk patients at the early stages of dialysis.

Download Full-text

Prognostic Machine Learning Models for First-Year Mortality in Incident Hemodialysis Patients: Development and Validation Study

JMIR Medical Informatics ◽

10.2196/20578 ◽

2020 ◽

Vol 8 (10) ◽

pp. e20578

Author(s):

Kaixiang Sheng ◽

Ping Zhang ◽

Xi Yao ◽

Jiawei Li ◽

Yongchun He ◽

...

Keyword(s):

Machine Learning ◽

High Risk ◽

Dialysis Initiation ◽

Gradient Boosting ◽

First Year ◽

Learning Models ◽

High Risk Patients ◽

Extreme Gradient Boosting ◽

Risk Patients ◽

Machine Learning Models

Background The first-year survival rate among patients undergoing hemodialysis remains poor. Current mortality risk scores for patients undergoing hemodialysis employ regression techniques and have limited applicability and robustness. Objective We aimed to develop a machine learning model utilizing clinical factors to predict first-year mortality in patients undergoing hemodialysis that could assist physicians in classifying high-risk patients. Methods Training and testing cohorts consisted of 5351 patients from a single center and 5828 patients from 97 renal centers undergoing hemodialysis (incident only). The outcome was all-cause mortality during the first year of dialysis. Extreme gradient boosting was used for algorithm training and validation. Two models were established based on the data obtained at dialysis initiation (model 1) and data 0-3 months after dialysis initiation (model 2), and 10-fold cross-validation was applied to each model. The area under the curve (AUC), sensitivity (recall), specificity, precision, balanced accuracy, and F1 score were used to assess the predictive ability of the models. Results In the training and testing cohorts, 585 (10.93%) and 764 (13.11%) patients, respectively, died during the first-year follow-up. Of 42 candidate features, the 15 most important features were selected. The performance of model 1 (AUC 0.83, 95% CI 0.78-0.84) was similar to that of model 2 (AUC 0.85, 95% CI 0.81-0.86). Conclusions We developed and validated 2 machine learning models to predict first-year mortality in patients undergoing hemodialysis. Both models could be used to stratify high-risk patients at the early stages of dialysis.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Predicting Electric Vehicle Charging Station Availability Using Ensemble Machine Learning

Energies ◽

10.3390/en14237834 ◽

2021 ◽

Vol 14 (23) ◽

pp. 7834

Author(s):

Christopher Hecht ◽

Jan Figgener ◽

Dirk Uwe Sauer

Keyword(s):

Machine Learning ◽

Binary Data ◽

Training Data ◽

Gradient Boosting ◽

Traffic Density ◽

Learning Models ◽

Charging Infrastructure ◽

Ensemble Models ◽

Charging Station ◽

Machine Learning Models

Electric vehicles may reduce greenhouse gas emissions from individual mobility. Due to the long charging times, accurate planning is necessary, for which the availability of charging infrastructure must be known. In this paper, we show how the occupation status of charging infrastructure can be predicted for the next day using machine learning models— Gradient Boosting Classifier and Random Forest Classifier. Since both are ensemble models, binary training data (occupied vs. available) can be used to provide a certainty measure for predictions. The prediction may be used to adapt prices in a high-load scenario, predict grid stress, or forecast available power for smart or bidirectional charging. The models were chosen based on an evaluation of 13 different, typically used machine learning models. We show that it is necessary to know past charging station usage in order to predict future usage. Other features such as traffic density or weather have a limited effect. We show that a Gradient Boosting Classifier achieves 94.8% accuracy and a Matthews correlation coefficient of 0.838, making ensemble models a suitable tool. We further demonstrate how a model trained on binary data can perform non-binary predictions to give predictions in the categories “low likelihood” to “high likelihood”.

Download Full-text

Estimation of Chlorophyll-a Concentrations in Small Water Bodies: Comparison of Fused Gaofen-6 and Sentinel-2 Sensors

Remote Sensing ◽

10.3390/rs14010229 ◽

2022 ◽

Vol 14 (1) ◽

pp. 229

Author(s):

Jiarui Shi ◽

Qian Shen ◽

Yue Yao ◽

Junsheng Li ◽

Fu Chen ◽

...

Keyword(s):

Machine Learning ◽

Chlorophyll A ◽

Water Bodies ◽

Gradient Boosting ◽

Learning Models ◽

Small Water ◽

Extreme Gradient Boosting ◽

Machine Learning Models ◽

Sentinel 2 ◽

Small Water Bodies

Chlorophyll-a concentrations in water bodies are one of the most important environmental evaluation indicators in monitoring the water environment. Small water bodies include headwater streams, springs, ditches, flushes, small lakes, and ponds, which represent important freshwater resources. However, the relatively narrow and fragmented nature of small water bodies makes it difficult to monitor chlorophyll-a via medium-resolution remote sensing. In the present study, we first fused Gaofen-6 (a new Chinese satellite) images to obtain 2 m resolution images with 8 bands, which was approved as a good data source for Chlorophyll-a monitoring in small water bodies as Sentinel-2. Further, we compared five semi-empirical and four machine learning models to estimate chlorophyll-a concentrations via simulated reflectance using fused Gaofen-6 and Sentinel-2 spectral response function. The results showed that the extreme gradient boosting tree model (one of the machine learning models) is the most accurate. The mean relative error (MRE) was 9.03%, and the root-mean-square error (RMSE) was 4.5 mg/m3 for the Sentinel-2 sensor, while for the fused Gaofen-6 image, MRE was 6.73%, and RMSE was 3.26 mg/m3. Thus, both fused Gaofen-6 and Sentinel-2 could estimate the chlorophyll-a concentrations in small water bodies. Since the fused Gaofen-6 exhibited a higher spatial resolution and Sentinel-2 exhibited a higher temporal resolution.

Download Full-text

Benchmarking of Machine Learning Models to Assist the Prognosis of Tuberculosis

10.20944/preprints202103.0284.v2 ◽

2021 ◽

Author(s):

Maicon Herverton Lino Ferreira da Silva Barros ◽

Geovanne Oliveira Alves ◽

Lubnnia Morais Florêncio Souza ◽

Élisson da Silva Rocha ◽

João Fausto Lorenzato de Oliveira ◽

...

Keyword(s):

Machine Learning ◽

Clinical Symptoms ◽

Treatment Decision ◽

Gradient Boosting ◽

Original Form ◽

Learning Models ◽

Data Set ◽

Risk Of Death ◽

Increased Risk ◽

Machine Learning Models

Tuberculosis (TB) is an airborne infectious disease caused by organisms in the Mycobacterium tuberculosis (Mtb) complex. In many low and middle-income countries, TB remains a major cause of morbidity and mortality. Once a patient has been diagnosed with TB, it is critical that healthcare workers make the most appropriate treatment decision given the individual conditions of the patient and the likely course of the disease based on medical experience. Depending on the prognosis, delayed or inappropriate treatment can result in unsatisfactory results including the exacerbation of clinical symptoms, poor quality of life, and increased risk of death. This work benchmarks machine learning models to aid TB prognosis using a Brazilian health database of confirmed cases and deaths related to TB in the State of Amazonas. The goal is to predict the probability of death by TB thus aiding the prognosis of TB and associated treatment decision making process. In its original form, the data set comprised 36,228 records and 130 fields but suffered from missing, incomplete, or incorrect data. Following data cleaning and preprocessing, a revised data set was generated comprising 24,015 records and 38 fields, including 22,876 reported cured TB patients and 1,139 deaths by TB. To explore how the data imbalance impacts model performance, two controlled experiments were designed using (1) imbalanced and (2) balanced data sets. The best result is achieved by the Gradient Boosting (GB) model using the balanced data set to predict TB-mortality, and the ensemble model composed by the Random Forest (RF), GB and Multi-layer Perceptron (MLP) models is the best model to predict the cure class.

Download Full-text

Learning to Identify At-Risk Students in Distance Education Using Interaction Counts

Revista de Informática Teórica e Aplicada ◽

10.22456/2175-2745.62211 ◽

2016 ◽

Vol 23 (2) ◽

pp. 124 ◽

Cited By ~ 2

Author(s):

Douglas Detoni ◽

Cristian Cechinel ◽

Ricardo Araujo Matsumura ◽

Daniela Francisco Brauner

Keyword(s):

Machine Learning ◽

At Risk ◽

At Risk Students ◽

Drop Out ◽

Support Vector ◽

Learning Models ◽

Data Set ◽

Student Dropout ◽

Vector Machines ◽

Machine Learning Models

Student dropout is one of the main problems faced by distance learning courses. One of the major challenges for researchers is to develop methods to predict the behavior of students so that teachers and tutors are able to identify at-risk students as early as possible and provide assistance before they drop out or fail in their courses. Machine Learning models have been used to predict or classify students in these settings. However, while these models have shown promising results in several settings, they usually attain these results using attributes that are not immediately transferable to other courses or platforms. In this paper, we provide a methodology to classify students using only interaction counts from each student. We evaluate this methodology on a data set from two majors based on the Moodle platform. We run experiments consisting of training and evaluating three machine learning models (Support Vector Machines, Naive Bayes and Adaboost decision trees) under different scenarios. We provide evidences that patterns from interaction counts can provide useful information for classifying at-risk students. This classification allows the customization of the activities presented to at-risk students (automatically or through tutors) as an attempt to avoid students drop out.

Download Full-text

Active learning for the power factor prediction in diamond-like thermoelectric materials

npj Computational Materials ◽

10.1038/s41524-020-00439-8 ◽

2020 ◽

Vol 6 (1) ◽

Author(s):

Ye Sheng ◽

Yasong Wu ◽

Jiong Yang ◽

Wencong Lu ◽

Pierre Villars ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Transport Properties ◽

Gradient Boosting ◽

Learning Models ◽

Material Development ◽

Materials Genome ◽

P Type ◽

Type Power ◽

Machine Learning Models

Abstract The Materials Genome Initiative requires the crossing of material calculations, machine learning, and experiments to accelerate the material development process. In recent years, data-based methods have been applied to the thermoelectric field, mostly on the transport properties. In this work, we combined data-driven machine learning and first-principles automated calculations into an active learning loop, in order to predict the p-type power factors (PFs) of diamond-like pnictides and chalcogenides. Our active learning loop contains two procedures (1) based on a high-throughput theoretical database, machine learning methods are employed to select potential candidates and (2) computational verification is applied to these candidates about their transport properties. The verification data will be added into the database to improve the extrapolation abilities of the machine learning models. Different strategies of selecting candidates have been tested, finally the Gradient Boosting Regression model of Query by Committee strategy has the highest extrapolation accuracy (the Pearson R = 0.95 on untrained systems). Based on the prediction from the machine learning models, binary pnictides, vacancy, and small atom-containing chalcogenides are predicted to have large PFs. The bonding analysis reveals that the alterations of anionic bonding networks due to small atoms are beneficial to the PFs in these compounds.

Download Full-text

Time-domain Feature and Ensemble Model based Classification of EMG Signals for Hand Gesture Recognition

10.21203/rs.3.rs-605286/v1 ◽

2021 ◽

Author(s):

Debarati Bhattacharjee ◽

Munesh Singh

Keyword(s):

Machine Learning ◽

Time Domain ◽

Electrical Current ◽

Clinical Diagnostics ◽

Gradient Boosting ◽

Learning Models ◽

Absolute Value ◽

Feature Pair ◽

Emg Signal ◽

Machine Learning Models

Abstract The electromyography (EMG) signal is the electrical current generated in muscles due to the inter-change of ions during their contractions. It has many applications in clinical diagnostics and the biomedical field. This paper has experimented with various ensemble algorithms and time-domain features to classify eight types of hand gestures. To train and test the machine learning models, we have extracted eight types of time-domain features from the raw EMG signals, such as integrated EMG (IEMG), variance, mean absolute value (MAV), modified mean absolute value type 1, waveform length, root mean square, average amplitude change, and difference absolute standard deviation value. The ensemble machine learning models are based on stacking, bagging, and gradient boosting. We have used four different-sized training sets to evaluate the performance of these classifiers. From the performance evaluation, we have identified the XG-Boost (gblinear) classifier with the IEMG feature as the best classifier-feature pair. The proposed classifier-feature pair has given better performance with a classification accuracy of 98.33% and a processing time of 5.67 μs for one vector than the existing extended associative memory-MAV classifier-feature pair.

Download Full-text

Machine Learning Approach to Reduce Alert Fatigue Using a Disease Medication–Related Clinical Decision Support System: Model Development and Validation (Preprint)

10.2196/preprints.19489 ◽

2020 ◽

Author(s):

Tahmina Nasrin Poly ◽

Md.Mohaimenul Islam ◽

Muhammad Solihuddin Muhtar ◽

Hsuan-Chia Yang ◽

Phung Anh (Alex) Nguyen ◽

...

Keyword(s):

Machine Learning ◽

Decision Support ◽

Clinical Decision Support ◽

Prediction Models ◽

Clinical Decision ◽

Gradient Boosting ◽

Support Vector ◽

Learning Models ◽

Alert Fatigue ◽

Machine Learning Models

BACKGROUND Computerized physician order entry (CPOE) systems are incorporated into clinical decision support systems (CDSSs) to reduce medication errors and improve patient safety. Automatic alerts generated from CDSSs can directly assist physicians in making useful clinical decisions and can help shape prescribing behavior. Multiple studies reported that approximately 90%-96% of alerts are overridden by physicians, which raises questions about the effectiveness of CDSSs. There is intense interest in developing sophisticated methods to combat alert fatigue, but there is no consensus on the optimal approaches so far. OBJECTIVE Our objective was to develop machine learning prediction models to predict physicians’ responses in order to reduce alert fatigue from disease medication–related CDSSs. METHODS We collected data from a disease medication–related CDSS from a university teaching hospital in Taiwan. We considered prescriptions that triggered alerts in the CDSS between August 2018 and May 2019. Machine learning models, such as artificial neural network (ANN), random forest (RF), naïve Bayes (NB), gradient boosting (GB), and support vector machine (SVM), were used to develop prediction models. The data were randomly split into training (80%) and testing (20%) datasets. RESULTS A total of 6453 prescriptions were used in our model. The ANN machine learning prediction model demonstrated excellent discrimination (area under the receiver operating characteristic curve [AUROC] 0.94; accuracy 0.85), whereas the RF, NB, GB, and SVM models had AUROCs of 0.93, 0.91, 0.91, and 0.80, respectively. The sensitivity and specificity of the ANN model were 0.87 and 0.83, respectively. CONCLUSIONS In this study, ANN showed substantially better performance in predicting individual physician responses to an alert from a disease medication–related CDSS, as compared to the other models. To our knowledge, this is the first study to use machine learning models to predict physician responses to alerts; furthermore, it can help to develop sophisticated CDSSs in real-world clinical settings.

Download Full-text