Application of Ensemble Learning Using Weight Voting Protocol in the Prediction of Pile Bearing Capacity

Accurate prediction of pile bearing capacity is an important part of foundation engineering. Notably, the determination of pile bearing capacity through an in situ load test is costly and time-consuming. Therefore, this study focused on developing a machine learning algorithm, namely, Ensemble Learning (EL), using weight voting protocol of three base machine learning algorithms, gradient boosting (GB), random forest (RF), and classic linear regression (LR), to predict the bearing capacity of the pile. Data includes 108 pile load tests under different conditions used for model training and testing. Performance evaluation indicators such as R-square (R2), root mean square error (RMSE), and MAE (mean absolute error) were used to evaluate the performance of models showing the efficiency of predicting pile bearing capacity with outstanding performance compared to other models. The results also showed that the EL model with a weight combination of w 1 = 0.482, w 2 = 0.338, and w 3 = 0.18 corresponding to the models GB, RF, and LR gave the best performance and achieved the best balance on all data sets. In addition, the global sensitivity analysis technique was used to detect the most important input features in determining the bearing capacity of the pile. This study provides an effective tool to predict pile load capacity with expert performance.

Download Full-text

Windows PE Malware Detection Using Ensemble Learning

Informatics ◽

10.3390/informatics8010010 ◽

2021 ◽

Vol 8 (1) ◽

pp. 10

Author(s):

Nureni Ayofe Azeez ◽

Oluwanifise Ebunoluwa Odufuwa ◽

Sanjay Misra ◽

Jonathan Oluranti ◽

Robertas Damaševičius

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Ensemble Learning ◽

Learning Algorithm ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Internet Age ◽

Stage Classification ◽

End Stage

In this Internet age, there are increasingly many threats to the security and safety of users daily. One of such threats is malicious software otherwise known as malware (ransomware, Trojans, viruses, etc.). The effect of this threat can lead to loss or malicious replacement of important information (such as bank account details, etc.). Malware creators have been able to bypass traditional methods of malware detection, which can be time-consuming and unreliable for unknown malware. This motivates the need for intelligent ways to detect malware, especially new malware which have not been evaluated or studied before. Machine learning provides an intelligent way to detect malware and comprises two stages: feature extraction and classification. This study suggests an ensemble learning-based method for malware detection. The base stage classification is done by a stacked ensemble of fully-connected and one-dimensional convolutional neural networks (CNNs), whereas the end-stage classification is done by a machine learning algorithm. For a meta-learner, we analyzed and compared 15 machine learning classifiers. For comparison, five machine learning algorithms were used: naïve Bayes, decision tree, random forest, gradient boosting, and AdaBoosting. The results of experiments made on the Windows Portable Executable (PE) malware dataset are presented. The best results were obtained by an ensemble of seven neural networks and the ExtraTrees classifier as a final-stage classifier.

Download Full-text

Validation and Generalizability of Machine Learning Prediction Models on Attrition in Longitudinal Studies

10.31234/osf.io/mzhvx ◽

2021 ◽

Author(s):

Kristin Jankowsky ◽

Ulrich Schroeders

Keyword(s):

Machine Learning ◽

Longitudinal Studies ◽

Prediction Models ◽

Learning Algorithm ◽

Missing At Random ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Unrealistic Assumption ◽

Nationally Representative ◽

Survey Attrition

Attrition in longitudinal studies is a major threat to the representativeness of the data and the generalizability of the findings. Typical approaches to address systematic nonresponse are either expensive and unsatisfactory (e.g., oversampling) or rely on the unrealistic assumption of data missing at random (e.g., multiple imputation). Thus, models that effectively predict who most likely drops out in subsequent occasions might offer the opportunity to take countermeasures (e.g., incentives). With the current study, we introduce a longitudinal model validation approach and examine whether attrition in two nationally representative longitudinal panel studies can be predicted accurately. We compare the performance of a basic logistic regression model to a more flexible, data-driven machine learning algorithm––Gradient Boosting Machines. Our results show almost no difference in accuracies for both modeling approaches, which contradicts claims of similar studies on survey attrition. Prediction models could not be generalized across surveys and were less accurate when tested at a later survey wave. We discuss the implications of these findings for survey retention, the use of complex machine learning algorithms, and give some recommendations to deal with study attrition.

Download Full-text

Classification of hazelnut cultivars: comparison of DL4J and ensemble learning algorithms

Notulae Botanicae Horti Agrobotanici Cluj-Napoca ◽

10.15835/nbha48412041 ◽

2020 ◽

Vol 48 (4) ◽

pp. 2316-2327

Author(s):

Caner KOC ◽

Dilara GERDAN ◽

Maksut B. EMİNOĞLU ◽

Uğur YEGÜL ◽

Bulent KOC ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ensemble Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Performance Criteria ◽

Gradient Boosting ◽

Data Set

Classification of hazelnuts is one of the values adding processes that increase the marketability and profitability of its production. While traditional classification methods are used commonly, machine learning and deep learning can be implemented to enhance the hazelnut classification processes. This paper presents the results of a comparative study of machine learning frameworks to classify hazelnut (Corylus avellana L.) cultivars (‘Sivri’, ‘Kara’, ‘Tombul’) using DL4J and ensemble learning algorithms. For each cultivar, 50 samples were used for evaluations. Maximum length, width, compression strength, and weight of hazelnuts were measured using a caliper and a force transducer. Gradient boosting machine (Boosting), random forest (Bagging), and DL4J feedforward (Deep Learning) algorithms were applied in traditional machine learning algorithms. The data set was partitioned into a 10-fold-cross validation method. The classifier performance criteria of accuracy (%), error percentage (%), F-Measure, Cohen’s Kappa, recall, precision, true positive (TP), false positive (FP), true negative (TN), false negative (FN) values are provided in the results section. The results showed classification accuracies of 94% for Gradient Boosting, 100% for Random Forest, and 94% for DL4J Feedforward algorithms.

Download Full-text

Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.06.05 ◽

2021 ◽

Vol 13 (6) ◽

pp. 61-71

Author(s):

Isaac Kofi Nti ◽

◽

Owusu N yarko-Boateng ◽

Justice Aning

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Experimental Studies ◽

Area Under The Curve ◽

Essential Element ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Machine Learning Algorithm ◽

K Value

The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.

Download Full-text

Predicting Peritoneal Metastasis of Gastric Cancer Patients Based on Machine Learning

Cancer Control ◽

10.1177/1073274820968900 ◽

2020 ◽

Vol 27 (1) ◽

pp. 107327482096890

Author(s):

Chengmao Zhou ◽

Ying Wang ◽

Mu-Huo Ji ◽

Jianhua Tong ◽

Jian-Jun Yang ◽

...

Keyword(s):

Machine Learning ◽

Gastric Cancer ◽

Peritoneal Metastasis ◽

Learning Algorithm ◽

Test Group ◽

Training Group ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Depth Of Invasion ◽

Light Gradient

Objective: The aim is to explore the prediction effect of 5 machine learning algorithms on peritoneal metastasis of gastric cancer. Methods: 1080 patients with postoperative gastric cancer were divided into a training group and test group according to the ratio of 7:3. The model of peritoneal metastasis was established by using 5 machine learning (gbm(Light Gradient Boosting Machine), GradientBoosting, forest, Logistic and DecisionTree). Python pair was used to analyze the machine learning algorithm. Gbm algorithm is used to show the weight proportion of each variable to the result. Result: Correlation analysis showed that tumor size and depth of invasion were positively correlated with the recurrence of patients after gastric cancer surgery. The results of the gbm algorithm showed that the top 5 important factors were albumin, platelet count, depth of infiltration, preoperative hemoglobin and weight, respectively. In training group: Among the 5 algorithm models, the accuracy of GradientBoosting and gbm was the highest (0.909); the AUC values of the 5 algorithms are gbm (0.938), GradientBoosting (0.861), forest (0.796), Logistic(0.741) and DecisionTree(0.712) from high to low. In the test group: among the 5 algorithm models, the accuracy of forest, DecisionTree and gbm was the highest (0.907); AUC values ranged from high to low to gbm (0.745), GradientBoosting (0.725), forest (0.696), Logistic (0.680) and DecisionTree (0.657). Conclusion: Machine learning can predict the peritoneal metastasis in patients with gastric cancer.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

Intelligent system of English composition scoring model based on improved machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189235 ◽

2020 ◽

pp. 1-11

Author(s):

Jie Liu ◽

Lin Lin ◽

Xiufang Liang

Keyword(s):

Machine Learning ◽

Evaluation System ◽

Intelligent System ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Assessment System ◽

English Composition ◽

Region Extraction ◽

Constraint Model

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.

Download Full-text

Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

10.2196/preprints.17235 ◽

2019 ◽

Author(s):

Kasper Van Mens ◽

Joran Lokkerbol ◽

Richard Janssen ◽

Robert de Lange ◽

Bea Tiemens

Keyword(s):

Machine Learning ◽

Treatment Outcome ◽

Mental Health Treatment ◽

Mental Healthcare ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Trade Off ◽

Trade Offs ◽

Outcome Monitoring ◽

Extreme Gradient Boosting

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Download Full-text

Potassium Ferro cyanide electrochemically detected by Differential Pulse and Square Wave Voltammetry in a Competition using Gradient Boosting as Machine Learning Algorithm

2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit50672.2020.9254514 ◽

2020 ◽

Author(s):

Nemah Abu Shama ◽

Suleyman Asir ◽

Kamil Dimililer ◽

Mehmet Ozsoz

Keyword(s):

Machine Learning ◽

Square Wave Voltammetry ◽

Learning Algorithm ◽

Differential Pulse ◽

Gradient Boosting ◽

Machine Learning Algorithm ◽

Square Wave ◽

Wave Voltammetry

Download Full-text

Efficient detection of hacker community based on twitter data using complex networks and machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210458 ◽

2021 ◽

pp. 1-17

Author(s):

Ahmed Al-Tarawneh ◽

Ja’afer Al-Saraireh

Keyword(s):

Machine Learning ◽

Complex Networks ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

Efficient Detection ◽

Suggested Keywords

Twitter is one of the most popular platforms used to share and post ideas. Hackers and anonymous attackers use these platforms maliciously, and their behavior can be used to predict the risk of future attacks, by gathering and classifying hackers’ tweets using machine-learning techniques. Previous approaches for detecting infected tweets are based on human efforts or text analysis, thus they are limited to capturing the hidden text between tweet lines. The main aim of this research paper is to enhance the efficiency of hacker detection for the Twitter platform using the complex networks technique with adapted machine learning algorithms. This work presents a methodology that collects a list of users with their followers who are sharing their posts that have similar interests from a hackers’ community on Twitter. The list is built based on a set of suggested keywords that are the commonly used terms by hackers in their tweets. After that, a complex network is generated for all users to find relations among them in terms of network centrality, closeness, and betweenness. After extracting these values, a dataset of the most influential users in the hacker community is assembled. Subsequently, tweets belonging to users in the extracted dataset are gathered and classified into positive and negative classes. The output of this process is utilized with a machine learning process by applying different algorithms. This research build and investigate an accurate dataset containing real users who belong to a hackers’ community. Correctly, classified instances were measured for accuracy using the average values of K-nearest neighbor, Naive Bayes, Random Tree, and the support vector machine techniques, demonstrating about 90% and 88% accuracy for cross-validation and percentage split respectively. Consequently, the proposed network cyber Twitter model is able to detect hackers, and determine if tweets pose a risk to future institutions and individuals to provide early warning of possible attacks.

Download Full-text