An Introduction to Machine Learning for Panel Data: Decision Trees, Random Forests, and Other Dendrological Methods

A lot of research on prediction of cancer survivability has been done by implementing various machine learning models and it has always been a challenging task. In this project, the main focus is to perform a comprehensive evaluation of machine learning models across multiple cancer cohorts and find the models with better prediction capability. Class balancing techniques like oversampling and undersampling were implemented into the models to improve the performance of cancer survival prediction. SEER cancer dataset (1973-2015) was used for this project. After preprocessing, we included a total of 21 independent variables and a dependent variable. Multiple machine learning models like Decision Trees, Logistic Regression, Naive Bayes, Support Vector Machine, Random Forests and Multi-Layer Perceptron were implemented. Bias between training and testing data was eliminated by implementing stratified 10-fold crossvalidation. The experimental design was in such a way that all the machine learning models were implemented across seven cancer cohorts using all eligible records each cohort as well as using two sampling techniques for class balancing. Performance of the machine learning models were compared based on the metrics like Sensitivity, Accuracy, Specificity, Precision, F1 score and AUC scores. A total of 168 experimental models were designed and implemented. Comparison between the predictive models showed that Random Forests have best predicted for cancer survivability, Support Vector Machine came as second-best predictors, Logistic Regression as third, then Decision Trees, Multi-Layer Perceptron and lastly Naive Bayes with least performance. The results clearly indicated that implementing class balancing techniques also improved the performance of the models significantly.

Download Full-text

Predictors of remission from body dysmorphic disorder after internet-delivered cognitive behavior therapy: a machine learning approach

10.31234/osf.io/eqcdx ◽

2019 ◽

Author(s):

Oskar Flygare ◽

Jesper Enander ◽

Erik Andersson ◽

Brjánn Ljótsson ◽

Volen Z Ivanov ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Clinical Utility ◽

Body Dysmorphic Disorder ◽

Prediction Models ◽

Behavioral Therapy ◽

Learning Approach ◽

Learning Approaches ◽

Machine Learning Approach

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.

Download Full-text

Predictors of progression through the cascade of care to a cure for hepatitis C patients using decision trees and random forests

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104461 ◽

2021 ◽

pp. 104461

Author(s):

Jasmine Ye Nakayama ◽

Joyce Ho ◽

Emily Cartwright ◽

Roy Simpson ◽

Vicki Stover Hertzberg

Keyword(s):

Hepatitis C ◽

Decision Trees ◽

Random Forests ◽

Cascade Of Care

Download Full-text

Multi-Class Assessment Based on Random Forests

Education Sciences ◽

10.3390/educsci11030092 ◽

2021 ◽

Vol 11 (3) ◽

pp. 92

Author(s):

Mehdi Berriri ◽

Sofiane Djema ◽

Gaëtan Rey ◽

Christel Dartigues-Pallez

Keyword(s):

Higher Education ◽

Machine Learning ◽

Random Forests ◽

Learning Algorithm ◽

Teaching Staff ◽

Machine Learning Algorithm ◽

Process Data ◽

Training Courses ◽

Education Courses

Today, many students are moving towards higher education courses that do not suit them and end up failing. The purpose of this study is to help provide counselors with better knowledge so that they can offer future students courses corresponding to their profile. The second objective is to allow the teaching staff to propose training courses adapted to students by anticipating their possible difficulties. This is possible thanks to a machine learning algorithm called Random Forest, allowing for the classification of the students depending on their results. We had to process data, generate models using our algorithm, and cross the results obtained to have a better final prediction. We tested our method on different use cases, from two classes to five classes. These sets of classes represent the different intervals with an average ranging from 0 to 20. Thus, an accuracy of 75% was achieved with a set of five classes and up to 85% for sets of two and three classes.

Download Full-text

Development of Machine Learning Models to Predict Probabilities and Types of Stroke at Prehospital Stage: the Japan Urgent Stroke Triage Score Using Machine Learning (JUST-ML)

Translational Stroke Research ◽

10.1007/s12975-021-00937-x ◽

2021 ◽

Author(s):

Kazutaka Uchida ◽

Junichi Kouno ◽

Shinichi Yoshimura ◽

Norito Kinjo ◽

Fumihiro Sakakibara ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Prediction Models ◽

Characteristic Curve ◽

Predictive Performance ◽

Vessel Occlusion ◽

Predictive Values ◽

Training Cohort ◽

Sensitivity Specificity

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.

Download Full-text

Performance Improvement of Decision Tree: A Robust Classifier Using Tabu Search Algorithm

Applied Sciences ◽

10.3390/app11156728 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6728

Author(s):

Muhammad Asfand Hafeez ◽

Muhammad Rashid ◽

Hassan Tariq ◽

Zain Ul Abideen ◽

Saud S. Alotaibi ◽

...

Keyword(s):

Machine Learning ◽

Tabu Search ◽

Decision Tree ◽

Decision Trees ◽

Search Algorithm ◽

Learning Algorithms ◽

Performance Comparison ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Tabu Search Algorithm

Classification and regression are the major applications of machine learning algorithms which are widely used to solve problems in numerous domains of engineering and computer science. Different classifiers based on the optimization of the decision tree have been proposed, however, it is still evolving over time. This paper presents a novel and robust classifier based on a decision tree and tabu search algorithms, respectively. In the aim of improving performance, our proposed algorithm constructs multiple decision trees while employing a tabu search algorithm to consistently monitor the leaf and decision nodes in the corresponding decision trees. Additionally, the used tabu search algorithm is responsible to balance the entropy of the corresponding decision trees. For training the model, we used the clinical data of COVID-19 patients to predict whether a patient is suffering. The experimental results were obtained using our proposed classifier based on the built-in sci-kit learn library in Python. The extensive analysis for the performance comparison was presented using Big O and statistical analysis for conventional supervised machine learning algorithms. Moreover, the performance comparison to optimized state-of-the-art classifiers is also presented. The achieved accuracy of 98%, the required execution time of 55.6 ms and the area under receiver operating characteristic (AUROC) for proposed method of 0.95 reveals that the proposed classifier algorithm is convenient for large datasets.

Download Full-text

An Optimized Stacking Ensemble Model for Phishing Websites Detection

Electronics ◽

10.3390/electronics10111285 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1285

Author(s):

Mohammed Al-Sarem ◽

Faisal Saeed ◽

Zeyad Ghaleb Al-Mekhlafi ◽

Badiea Abdulkarem Mohammed ◽

Tawfik Al-Hadhrami ◽

...

Keyword(s):

Machine Learning ◽

Random Forests ◽

Ensemble Method ◽

Detection Methods ◽

Detection Accuracy ◽

Ensemble Model ◽

Security Attacks ◽

Data Set ◽

Machine Learning Methods ◽

Ensemble Machine Learning

Security attacks on legitimate websites to steal users’ information, known as phishing attacks, have been increasing. This kind of attack does not just affect individuals’ or organisations’ websites. Although several detection methods for phishing websites have been proposed using machine learning, deep learning, and other approaches, their detection accuracy still needs to be enhanced. This paper proposes an optimized stacking ensemble method for phishing website detection. The optimisation was carried out using a genetic algorithm (GA) to tune the parameters of several ensemble machine learning methods, including random forests, AdaBoost, XGBoost, Bagging, GradientBoost, and LightGBM. The optimized classifiers were then ranked, and the best three models were chosen as base classifiers of a stacking ensemble method. The experiments were conducted on three phishing website datasets that consisted of both phishing websites and legitimate websites—the Phishing Websites Data Set from UCI (Dataset 1); Phishing Dataset for Machine Learning from Mendeley (Dataset 2, and Datasets for Phishing Websites Detection from Mendeley (Dataset 3). The experimental results showed an improvement using the optimized stacking ensemble method, where the detection accuracy reached 97.16%, 98.58%, and 97.39% for Dataset 1, Dataset 2, and Dataset 3, respectively.

Download Full-text

Improving External Validity of Machine Learning, Reduced Form, and Structural Macroeconomic Models using Panel Data

SSRN Electronic Journal ◽

10.2139/ssrn.3839863 ◽

2021 ◽

Author(s):

Cameron Fen ◽

Samir Undavia

Keyword(s):

Machine Learning ◽

Panel Data ◽

External Validity ◽

Reduced Form ◽

Macroeconomic Models

Download Full-text