scholarly journals A Machine Learning Approach to Predictive Modelling of Student Performance

F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 1144
Author(s):  
Hu Ng ◽  
Azmin Alias bin Mohd Azha ◽  
Timothy Tzen Vun Yap ◽  
Vik Tor Goh

Background - Many factors affect student performance such as the individual’s background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal. Methods – In this study, two datasets collected from two secondary schools in Portugal. First, the data used in the study is augmented to increase the sample size by merging the two datasets. Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes.  The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM), Naïve Bayes (NB), and Multilayer Perceptron (MLP) origins are designed and their performances evaluated. Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively.

Recently Educational Data Mining (EDM) has attracted many researchers in recent years. Many techniques of data mining are formulated to generate the techniques of the knowledge that is hidden within the educational data. The knowledge which is extracted aid the educational institutions to enhance the teaching process and learning methods. These improvements enhance the student performance and the performance of overall outputs. In EDM, Feature Selection (FS) plays a significant role in the improvement of quality of the models used for the purpose of prediction of educational datasets. Single feature selection algorithms do not render enhanced results of prediction. In this proposed work, Ensemble Swarm based Feature Selection (ESFS) and Ensemble Three Classifiers (ETCs) is formulated to classify the performance of students based on the selected features. This work concentrates on ESFS techniques are formulated to select the important and intrinsic features before the process of classification, ETCs are proposed. The samples are selected from the knowledge repository, which is initially pre-processed by means of Min Max Normalization (MMN) and Z Score Normalization (ZCN) method. Then the selected attributes from the technique called Ensemble Swarm based Feature Selection (ESFS) are combined to the learner’s communication together with e-learning management system. ESFS algorithm fuses the Fuzzy Membership Genetic Algorithm (FMGA) and Improved Clonal Selection Algorithms (ICSAs). Also, Ensemble Three Classifiers (ETCs) is identified for the prediction of students’ performance by combining the qualifiers like Adaptive Neuro Fuzzy Inference System (ANFIS), Support Vector Machine (SVM) classifier and Decision Tree (DT). A widespread ensemble approach namely Bagging is utilized to combine all the results of three classifiers. The results that are obtained are found to have strong relationship among the learner’s behaviors and their academic achievement.


Author(s):  
Lavinia Chiara Tagliabue ◽  
Stefano Rinaldi ◽  
Mario Favalli Ragusini ◽  
Giovanni Tardioli ◽  
Angelo Luigi Camillo Ciribini

Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2133
Author(s):  
Francisco O. Cortés-Ibañez ◽  
Sunil Belur Nagaraj ◽  
Ludo Cornelissen ◽  
Gerjan J. Navis ◽  
Bert van der Vegt ◽  
...  

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.


Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


2021 ◽  
Vol 15 (6) ◽  
pp. 1812-1819
Author(s):  
Azita Yazdani ◽  
Ramin Ravangard ◽  
Roxana Sharifian

The new coronavirus has been spreading since the beginning of 2020 and many efforts have been made to develop vaccines to help patients recover. It is now clear that the world needs a rapid solution to curb the spread of COVID-19 worldwide with non-clinical approaches such as data mining, enhanced intelligence, and other artificial intelligence techniques. These approaches can be effective in reducing the burden on the health care system to provide the best possible way to diagnose and predict the COVID-19 epidemic. In this study, data mining models for early detection of Covid-19 in patients were developed using the epidemiological dataset of patients and individuals suspected of having Covid-19 in Iran. C4.5, support vector machine, Naive Bayes, logistic regression, Random Forest, and k-nearest neighbor algorithm were used directly on the dataset using Rapid miner to develop the models. By receiving clinical signs, this model diagnosis the risk of contracting the COVID-19 virus. Examination of the models in this study has shown that the support vector machine with 93.41% accuracy is more efficient in the diagnosis of patients with COVID-19 pandemic, which is the best model among other developed models. Keywords: COVID-19, Data mining, Machine Learning, Artificial Intelligence, Classification


: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.


2019 ◽  
Vol 123 (1267) ◽  
pp. 1415-1436 ◽  
Author(s):  
A. B. A. Anderson ◽  
A. J. Sanjeev Kumar ◽  
A. B. Arockia Christopher

ABSTRACTData mining is a process of finding correlations and collecting and analysing a huge amount of data in a database to discover patterns or relationships. Flight delay creates significant problems in the present aviation system. Data mining techniques are desired for analysing the performance in which micro-level causes propagate to make system-level patterns of delay. Analysing flight delays is very difficult – both when looking from a historical view as well as when estimating delays with forecast demand. This paper proposes using Decision Tree (DT), Support Vector Machine (SVM), Naive Bayesian (NB), K-nearest neighbour (KNN) and Artificial Neural Network (ANN) to study and analyse delays among aircrafts. The performance of different data mining methods is found in the different regions of the updated datasets on these classifiers. Finally, the result shows a significant variation in the performance of different data mining methods and feature selection for this problem. This paper aims to deal with how data mining techniques can be used to understand difficult aircraft system delays in aviation. Our aim is to develop a classification model for studying and reducing delay using different data mining methods and, in this manner, to show that DT has a greater classification accuracy. The different feature selectors are used in this study in order to reduce the number of initial attributes. Our results clearly demonstrate the value of DT for analysing and visualising how system-level effects happen from subsystem-level causes.


Sign in / Sign up

Export Citation Format

Share Document