scholarly journals Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques

2021 ◽  
Vol 11 (7) ◽  
pp. 3130
Author(s):  
Janka Kabathova ◽  
Martin Drlik

Early and precisely predicting the students’ dropout based on available educational data belongs to the widespread research topic of the learning analytics research field. Despite the amount of already realized research, the progress is not significant and persists on all educational data levels. Even though various features have already been researched, there is still an open question, which features can be considered appropriate for different machine learning classifiers applied to the typical scarce set of educational data at the e-learning course level. Therefore, the main goal of the research is to emphasize the importance of the data understanding, data gathering phase, stress the limitations of the available datasets of educational data, compare the performance of several machine learning classifiers, and show that also a limited set of features, which are available for teachers in the e-learning course, can predict student’s dropout with sufficient accuracy if the performance metrics are thoroughly considered. The data collected from four academic years were analyzed. The features selected in this study proved to be applicable in predicting course completers and non-completers. The prediction accuracy varied between 77 and 93% on unseen data from the next academic year. In addition to the frequently used performance metrics, the comparison of machine learning classifiers homogeneity was analyzed to overcome the impact of the limited size of the dataset on obtained high values of performance metrics. The results showed that several machine learning algorithms could be successfully applied to a scarce dataset of educational data. Simultaneously, classification performance metrics should be thoroughly considered before deciding to deploy the best performance classification model to predict potential dropout cases and design beneficial intervention mechanisms.

2021 ◽  
Vol 25 (5) ◽  
pp. 1073-1098
Author(s):  
Nor Hamizah Miswan ◽  
Chee Seng Chan ◽  
Chong Guan Ng

Hospital readmission is a major cost for healthcare systems worldwide. If patients with a higher potential of readmission could be identified at the start, existing resources could be used more efficiently, and appropriate plans could be implemented to reduce the risk of readmission. Therefore, it is important to predict the right target patients. Medical data is usually noisy, incomplete, and inconsistent. Hence, before developing a prediction model, it is crucial to efficiently set up the predictive model so that improved predictive performance is achieved. The current study aims to analyse the impact of different preprocessing methods on the performance of different machine learning classifiers. The preprocessing applied by previous hospital readmission studies were compared, and the most common approaches highlighted such as missing value imputation, feature selection, data balancing, and feature scaling. The hyperparameters were selected using Bayesian optimisation. The different preprocessing pipelines were assessed using various performance metrics and computational costs. The results indicated that the preprocessing approaches helped improve the model’s prediction of hospital readmission.


2020 ◽  
Vol 10 (2) ◽  
pp. 1-26
Author(s):  
Naghmeh Moradpoor Sheykhkanloo ◽  
Adam Hall

An insider threat can take on many forms and fall under different categories. This includes malicious insider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machine learning techniques have been studied in published literature as a promising solution for such threats. However, they can be biased and/or inaccurate when the associated dataset is hugely imbalanced. Therefore, this article addresses the insider threat detection on an extremely imbalanced dataset which includes employing a popular balancing technique known as spread subsample. The results show that although balancing the dataset using this technique did not improve performance metrics, it did improve the time taken to build the model and the time taken to test the model. Additionally, the authors realised that running the chosen classifiers with parameters other than the default ones has an impact on both balanced and imbalanced scenarios, but the impact is significantly stronger when using the imbalanced dataset.


2019 ◽  
Vol 492 (2) ◽  
pp. 2897-2909 ◽  
Author(s):  
L Zorich ◽  
K Pichara ◽  
P Protopapas

ABSTRACT In the last years, automatic classification of variable stars has received substantial attention. Using machine learning techniques for this task has proven to be quite useful. Typically, machine learning classifiers used for this task require to have a fixed training set, and the training process is performed offline. Upcoming surveys such as the Large Synoptic Survey Telescope will generate new observations daily, where an automatic classification system able to create alerts online will be mandatory. A system with those characteristics must be able to update itself incrementally. Unfortunately, after training, most machine learning classifiers do not support the inclusion of new observations in light curves, they need to re-train from scratch. Naively re-training from scratch is not an option in streaming settings, mainly because of the expensive pre-processing routines required to obtain a vector representation of light curves (features) each time we include new observations. In this work, we propose a streaming probabilistic classification model; it uses a set of newly designed features that work incrementally. With this model, we can have a machine learning classifier that updates itself in real time with new observations. To test our approach, we simulate a streaming scenario with light curves from Convention, Rotation and planetary Transits (CoRoT), Orbital Gravitational Lensing Experiment (OGLE), and Massive Compact Halo Object (MACHO) catalogues. Results show that our model achieves high classification performance, staying an order of magnitude faster than traditional classification approaches.


Sales forecasting is an important when it comes to companies who are engaged in retailing, logistics, manufacturing, marketing and wholesaling. It allows companies to allocate resources efficiently, to estimate revenue of the sales and to plan strategies which are better for company’s future. In this paper, predicting product sales from a particular store is done in a way that produces better performance compared to any machine learning algorithms. The dataset used for this project is Big Mart Sales data of the 2013.Nowadays shopping malls and Supermarkets keep track of the sales data of the each and every individual item for predicting the future demand of the customer. It contains large amount of customer data and the item attributes. Further, the frequent patterns are detected by mining the data from the data warehouse. Then the data can be used for predicting the sales of the future with the help of several machine learning techniques (algorithms) for the companies like Big Mart. In this project, we propose a model using the Xgboost algorithm for predicting sales of companies like Big Mart and founded that it produces better performance compared to other existing models. An analysis of this model with other models in terms of their performance metrics is made in this project. Big Mart is an online marketplace where people can buy or sell or advertise your merchandise at low cost. The goal of the paper is to make Big Mart the shopping paradise for the buyers and a marketing solutions for the sellers as well. The ultimate aim is the complete satisfaction of the customers. The project “SUPERMARKET SALES PREDICTION” builds a predictive model and finds out the sales of each of the product at a particular store. The Big Mart use this model to under the properties of the products which plays a major role in increasing the sales. This can also be done on the basis hypothesis that should be done before looking at the data


Author(s):  
Mohsen Kamyab ◽  
Stephen Remias ◽  
Erfan Najmi ◽  
Kerrick Hood ◽  
Mustafa Al-Akshar ◽  
...  

According to the Federal Highway Administration (FHWA), US work zones on freeways account for nearly 24% of nonrecurring freeway delays and 10% of overall congestion. Historically, there have been limited scalable datasets to investigate the specific causes of congestion due to work zones or to improve work zone planning processes to characterize the impact of work zone congestion. In recent years, third-party data vendors have provided scalable speed data from Global Positioning System (GPS) devices and cell phones which can be used to characterize mobility on all roadways. Each work zone has unique characteristics and varying mobility impacts which are predicted during the planning and design phases, but can realistically be quite different from what is ultimately experienced by the traveling public. This paper uses these datasets to introduce a scalable Work Zone Mobility Audit (WZMA) template. Additionally, the paper uses metrics developed for individual work zones to characterize the impact of more than 250 work zones varying in length and duration from Southeast Michigan. The authors make recommendations to work zone engineers on useful data to collect for improving the WZMA. As more systematic work zone data are collected, improved analytical assessment techniques, such as machine learning processes, can be used to identify the factors that will predict future work zone impacts. The paper concludes by demonstrating two machine learning algorithms, Random Forest and XGBoost, which show historical speed variation is a critical component when predicting the mobility impact of work zones.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Kanggeun Lee ◽  
Hyoung-oh Jeong ◽  
Semin Lee ◽  
Won-Ki Jeong

AbstractWith recent advances in DNA sequencing technologies, fast acquisition of large-scale genomic data has become commonplace. For cancer studies, in particular, there is an increasing need for the classification of cancer type based on somatic alterations detected from sequencing analyses. However, the ever-increasing size and complexity of the data make the classification task extremely challenging. In this study, we evaluate the contributions of various input features, such as mutation profiles, mutation rates, mutation spectra and signatures, and somatic copy number alterations that can be derived from genomic data, and further utilize them for accurate cancer type classification. We introduce a novel ensemble of machine learning classifiers, called CPEM (Cancer Predictor using an Ensemble Model), which is tested on 7,002 samples representing over 31 different cancer types collected from The Cancer Genome Atlas (TCGA) database. We first systematically examined the impact of the input features. Features known to be associated with specific cancers had relatively high importance in our initial prediction model. We further investigated various machine learning classifiers and feature selection methods to derive the ensemble-based cancer type prediction model achieving up to 84% classification accuracy in the nested 10-fold cross-validation. Finally, we narrowed down the target cancers to the six most common types and achieved up to 94% accuracy.


2021 ◽  
Vol 3 (4) ◽  
pp. 32-37
Author(s):  
J. Adassuriya ◽  
J. A. N. S. S. Jayasinghe ◽  
K. P. S. C. Jayaratne

Machine learning algorithms play an impressive role in modern technology and address automation problems in many fields as these techniques can be used to identify features with high sensitivity, which humans or other programming techniques aren’t capable of detecting. In addition, the growth of the availability of the data demands the need of faster, accurate, and more reliable automating methods of extracting information, reforming, and preprocessing, and analyzing them in the world of science. The development of machine learning techniques to automate complex manual programs is a time relevant research in astrophysics as it’s a field where, experts are dealing with large sets of data every day. In this study, an automated classification was built for 6 types of star classes Beta Cephei, Delta Scuti, Gamma Doradus, Red Giants, RR Lyrae and RV Tarui with widely varying properties, features extracted from training dataset of stellar light curves obtained from Kepler mission. The Random Forest classification model was used as the Machine Learning model and both periodic and non-periodic features extracted from light curves were used as the inputs to the model. Our implementation achieved an accuracy of 86.5%, an average precision level of 0.86, an average recall value of 0.87, and average F1-Score of 0.86 for the testing dataset obtained from the Kepler mission.


2021 ◽  
Vol 2021 ◽  
pp. 1-35
Author(s):  
Thomas Rincy N ◽  
Roopam Gupta

Today’s internets are made up of nearly half a million different networks. In any network connection, identifying the attacks by their types is a difficult task as different attacks may have various connections, and their number may vary from a few to hundreds of network connections. To solve this problem, a novel hybrid network IDS called NID-Shield is proposed in the manuscript that classifies the dataset according to different attack types. Furthermore, the attack names found in attack types are classified individually helping considerably in predicting the vulnerability of individual attacks in various networks. The hybrid NID-Shield NIDS applies the efficient feature subset selection technique called CAPPER and distinct machine learning methods. The UNSW-NB15 and NSL-KDD datasets are utilized for the evaluation of metrics. Machine learning algorithms are applied for training the reduced accurate and highly merit feature subsets obtained from CAPPER and then assessed by the cross-validation method for the reduced attributes. Various performance metrics show that the hybrid NID-Shield NIDS applied with the CAPPER approach achieves a good accuracy rate and low FPR on the UNSW-NB15 and NSL-KDD datasets and shows good performance results when analyzed with various approaches found in existing literature studies.


Sign in / Sign up

Export Citation Format

Share Document