Bachelor Thesis Analytics: Using Machine Learning to Predict Dropout and Identify Performance Factors

<p class="0abstractCxSpLast">The bachelor thesis is commonly a necessary last step towards the first graduation in higher education and constitutes a central key to both further studies in higher education and employment that requires higher education degrees. Thus, completion of the thesis is a desirable outcome for individual students, academic institutions and society, and non-completion is a significant cost. Unfortunately, many academic institutions around the world experience that many thesis projects are not completed and that students struggle with the thesis process. This paper addresses this issue with the aim to, on the one hand, identify and explain why thesis projects are completed or not, and on the other hand, to predict non-completion and completion of thesis projects using machine learning algorithms. The sample for this study consisted of bachelor students’ thesis projects (n=2436) that have been started between 2010 and 2017. Data were extracted from two different data systems used to record data about thesis projects. From these systems, thesis project data were collected including variables related to both students and supervisors. Traditional statistical analysis (correlation tests, t-tests and factor analysis) was conducted in order to identify factors that influence non-completion and completion of thesis projects and several machine learning algorithms were applied in order to create a model that predicts completion and non-completion. When taking all the analysis mentioned above into account, it can be concluded with confidence that supervisors’ ability and experience play a significant role in determining the success of thesis projects, which, on the one hand, corroborates previous research. On the other hand, this study extends previous research by pointing out additional specific factors, such as the time supervisors take to complete thesis projects and the ratio of previously unfinished thesis projects. It can also be concluded that the academic title of the supervisor, which was one of the variables studied, did not constitute a factor for completing thesis projects. One of the more novel contributions of this study stems from the application of machine learning algorithms that were used in order to – reasonably accurately – predict thesis completion/non-completion. Such predictive models offer the opportunity to support a more optimal matching of students and supervisors.</p>

Download Full-text

Construction of Rapid Early Warning and Comprehensive Analysis Models for Urban Waterlogging Based on AutoML and Comparison of the other three Machine Learning Algorithms

Journal of Hydrology ◽

10.1016/j.jhydrol.2021.127367 ◽

2021 ◽

pp. 127367

Author(s):

Yuchen Guo ◽

Lihong Quan ◽

Lili Song ◽

Hao Liang

Keyword(s):

Machine Learning ◽

Early Warning ◽

Learning Algorithms ◽

Comprehensive Analysis ◽

Machine Learning Algorithms ◽

The Other ◽

Analysis Models

Download Full-text

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Genes ◽

10.3390/genes11090985 ◽

2020 ◽

Vol 11 (9) ◽

pp. 985 ◽

Cited By ~ 2

Author(s):

Thomas Vanhaeren ◽

Federico Divina ◽

Miguel García-Torres ◽

Francisco Gómez-Vela ◽

Wim Vanhoof ◽

...

Keyword(s):

Machine Learning ◽

Transcription Factors ◽

Long Range ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Supervised Machine Learning ◽

Chromatin Interaction ◽

Gradient Boosting ◽

Chromatin Interactions

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

Download Full-text

Method for a cloud based remaining-service-life-prediction for vehicle-gearboxes based on big-data-analysis and machine learning

Forschung im Ingenieurwesen ◽

10.1007/s10010-020-00415-0 ◽

2020 ◽

Vol 84 (4) ◽

pp. 305-314

Author(s):

Daniel Vietze ◽

Michael Hein ◽

Karsten Stahl

Keyword(s):

Machine Learning ◽

Big Data ◽

Service Life ◽

Operating Time ◽

The Other ◽

Learning Approaches ◽

State Of Health ◽

Remaining Service Life ◽

Other Hand ◽

The One

AbstractMost vehicle-gearboxes operating today are designed for a limited service-life. On the one hand, this creates significant potential for decreasing cost and mass as well as reduction of the carbon-footprint. On the other hand, this causes a rising risk of failure with increasing operating time of the machine. Especially if a failure can result in a high economic loss, this fact creates a conflict of goals. On the one hand, the machine should only be maintained or replaced when necessary and, on the other hand, the probability of a failure increases with longer operating times. Therefore, a method is desirable, making it possible to predict the remaining service-life and state of health with as little effort as possible.Centerpiece of gearboxes are the gears. A failure of these components usually causes the whole gearbox to fail. The fatigue life analysis deals with the dimensioning of gears according to the expected loads and the required service-life. Unfortunately, there is very little possibility to validate the technical design during operation, today. Hence, the goal of this paper is to present a method, enabling the prediction of the remaining-service-life and state-of-health of gears during operation. Within this method big-data and machine-learning approaches are used. The method is designed in a way, enabling an easy transfer to other machine elements and kinds of machinery.

Download Full-text

Privacy-Preserving Machine Learning Algorithms for Big Data Systems

2015 IEEE 35th International Conference on Distributed Computing Systems ◽

10.1109/icdcs.2015.40 ◽

2015 ◽

Cited By ~ 35

Author(s):

Kaihe Xu ◽

Hao Yue ◽

Linke Guo ◽

Yuanxiong Guo ◽

Yuguang Fang

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Algorithms ◽

Privacy Preserving ◽

Machine Learning Algorithms ◽

Data Systems ◽

Big Data Systems

Download Full-text

Land-Use Land-Cover Classification by Machine Learning Classifiers for Satellite Observations—A Review

Remote Sensing ◽

10.3390/rs12071135 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1135 ◽

Cited By ~ 15

Author(s):

Swapan Talukdar ◽

Pankaj Singha ◽

Susanta Mahato ◽

Shahfahad ◽

Swades Pal ◽

...

Keyword(s):

Machine Learning ◽

Land Use ◽

Land Cover ◽

Learning Algorithms ◽

Kappa Coefficient ◽

Machine Learning Algorithms ◽

The Other ◽

Land Use Land Cover ◽

Accuracy Level ◽

Parametric Classifier

Rapid and uncontrolled population growth along with economic and industrial development, especially in developing countries during the late twentieth and early twenty-first centuries, have increased the rate of land-use/land-cover (LULC) change many times. Since quantitative assessment of changes in LULC is one of the most efficient means to understand and manage the land transformation, there is a need to examine the accuracy of different algorithms for LULC mapping in order to identify the best classifier for further applications of earth observations. In this article, six machine-learning algorithms, namely random forest (RF), support vector machine (SVM), artificial neural network (ANN), fuzzy adaptive resonance theory-supervised predictive mapping (Fuzzy ARTMAP), spectral angle mapper (SAM) and Mahalanobis distance (MD) were examined. Accuracy assessment was performed by using Kappa coefficient, receiver operational curve (RoC), index-based validation and root mean square error (RMSE). Results of Kappa coefficient show that all the classifiers have a similar accuracy level with minor variation, but the RF algorithm has the highest accuracy of 0.89 and the MD algorithm (parametric classifier) has the least accuracy of 0.82. In addition, the index-based LULC and visual cross-validation show that the RF algorithm (correlations between RF and normalised differentiation water index, normalised differentiation vegetation index and normalised differentiation built-up index are 0.96, 0.99 and 1, respectively, at 0.05 level of significance) has the highest accuracy level in comparison to the other classifiers adopted. Findings from the literature also proved that ANN and RF algorithms are the best LULC classifiers, although a non-parametric classifier like SAM (Kappa coefficient 0.84; area under curve (AUC) 0.85) has a better and consistent accuracy level than the other machine-learning algorithms. Finally, this review concludes that the RF algorithm is the best machine-learning LULC classifier, among the six examined algorithms although it is necessary to further test the RF algorithm in different morphoclimatic conditions in the future.

Download Full-text

Educational data mining for student placement prediction using machine learning algorithms

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.2.8988 ◽

2017 ◽

Vol 7 (1.2) ◽

pp. 43 ◽

Cited By ~ 3

Author(s):

K. Sreenivasa Rao ◽

N. Swapna ◽

P. Praveen Kumar

Keyword(s):

Higher Education ◽

Machine Learning ◽

Data Mining ◽

Recursive Partitioning ◽

Learning Algorithms ◽

Educational Data Mining ◽

Regression Tree ◽

Machine Learning Algorithms ◽

Conditional Inference ◽

Higher Education Organizations

Data Mining is the process of extracting useful information from large sets of data. Data mining enablesthe users to have insights into the data and make useful decisions out of the knowledge mined from databases. The purpose of higher education organizations is to offer superior opportunities to its students. As with data mining, now-a-days Education Data Mining (EDM) also is considered as a powerful tool in the field of education. It portrays an effective method for mining the student’s performance based on various parameters to predict and analyze whether a student (he/she) will be recruited or not in the campus placement. Predictions are made using the machine learning algorithms J48, Naïve Bayes, Random Forest, and Random Tree in weka tool and Multiple Linear Regression, binomial logistic regression, Recursive Partitioning and Regression Tree (rpart), conditional inference tree (ctree) and Neural Network (nnet) algorithms in R studio. The results obtained from each approaches are then compared with respect to their performance and accuracy levels by graphical analysis. Based on the result, higher education organizations can offer superior training to its students.

Download Full-text

URLCam: Toolkit for malicious URL analysis and modeling

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189874 ◽

2021 ◽

pp. 1-15

Author(s):

Mohammed Ayub ◽

El-Sayed M. El-Alfy

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

State Of The Art ◽

Learning Algorithms ◽

Extraction Methods ◽

Machine Learning Algorithms ◽

The Other ◽

Imbalanced Learning ◽

Almost All

Web technology has become an indispensable part in human’s life for almost all activities. On the other hand, the trend of cyberattacks is on the rise in today’s modern Web-driven world. Therefore, effective countermeasures for the analysis and detection of malicious websites is crucial to combat the rising threats to the cyber world security. In this paper, we systematically reviewed the state-of-the-art techniques and identified a total of about 230 features of malicious websites, which are classified as internal and external features. Moreover, we developed a toolkit for the analysis and modeling of malicious websites. The toolkit has implemented several types of feature extraction methods and machine learning algorithms, which can be used to analyze and compare different approaches to detect malicious URLs. Moreover, the toolkit incorporates several other options such as feature selection and imbalanced learning with flexibility to be extended to include more functionality and generalization capabilities. Moreover, some use cases are demonstrated for different datasets.

Download Full-text

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

10.1101/2020.06.09.141473 ◽

2020 ◽

Author(s):

Thomas Vanhaeren ◽

Federico Divina ◽

Miguel García-Torres ◽

Francisco Gómez-Vela ◽

Wim Vanhoof ◽

...

Keyword(s):

Machine Learning ◽

Transcription Factors ◽

Long Range ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Supervised Machine Learning ◽

Chromatin Interaction ◽

Gradient Boosting ◽

Chromatin Interactions

AbstractThe role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model chromatin interactions in two human cell lines and evaluate the prediction performance of 5 popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines and multi-layer perceptron. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other four algorithms, yielding accuracies of ~ 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring.

Download Full-text

Stock Market Prediction using Machine Learning: A Systematic Literature Review

American Journal of Trade and Policy ◽

10.18034/ajtp.v4i3.521 ◽

2017 ◽

Vol 4 (3) ◽

pp. 123-128

Author(s):

Siddhartha Vadlamudi

Keyword(s):

Machine Learning ◽

Literature Review ◽

Stock Market ◽

Systematic Literature Review ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Stock Market Prediction ◽

Stock Value ◽

Challenging Tasks

Different machine learning algorithms are discussed in this literature review. These algorithms can be used for predicting the stock market. The prediction of the stock market is one of the challenging tasks that must have to be handled. In this paper, it is discussed how the machine learning algorithms can be used for predicting the stock value. Different attributes are identified that can be used for training the algorithm for this purpose. Some of the other factors are also discussed that can have an effect on the stock value.

Download Full-text

Application of Deep Learning for Credit Card Approval: A Comparison with Two Machine Learning Techniques

International Journal of Machine Learning and Computing ◽

10.18178/ijmlc.2021.11.4.1049 ◽

2021 ◽

Vol 11 (4) ◽

pp. 286-290

Author(s):

Md. Golam Kibria ◽

◽

Mehmet Sevkli

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Credit Card ◽

Learning Algorithms ◽

Learning Model ◽

Machine Learning Algorithms ◽

The Other ◽

Machine Learning Techniques ◽

Support Vector ◽

Deep Learning Model

The increased credit card defaulters have forced the companies to think carefully before the approval of credit applications. Credit card companies usually use their judgment to determine whether a credit card should be issued to the customer satisfying certain criteria. Some machine learning algorithms have also been used to support the decision. The main objective of this paper is to build a deep learning model based on the UCI (University of California, Irvine) data sets, which can support the credit card approval decision. Secondly, the performance of the built model is compared with the other two traditional machine learning algorithms: logistic regression (LR) and support vector machine (SVM). Our results show that the overall performance of our deep learning model is slightly better than that of the other two models.

Download Full-text