Windows PE Malware Detection Using Ensemble Learning

In this Internet age, there are increasingly many threats to the security and safety of users daily. One of such threats is malicious software otherwise known as malware (ransomware, Trojans, viruses, etc.). The effect of this threat can lead to loss or malicious replacement of important information (such as bank account details, etc.). Malware creators have been able to bypass traditional methods of malware detection, which can be time-consuming and unreliable for unknown malware. This motivates the need for intelligent ways to detect malware, especially new malware which have not been evaluated or studied before. Machine learning provides an intelligent way to detect malware and comprises two stages: feature extraction and classification. This study suggests an ensemble learning-based method for malware detection. The base stage classification is done by a stacked ensemble of fully-connected and one-dimensional convolutional neural networks (CNNs), whereas the end-stage classification is done by a machine learning algorithm. For a meta-learner, we analyzed and compared 15 machine learning classifiers. For comparison, five machine learning algorithms were used: naïve Bayes, decision tree, random forest, gradient boosting, and AdaBoosting. The results of experiments made on the Windows Portable Executable (PE) malware dataset are presented. The best results were obtained by an ensemble of seven neural networks and the ExtraTrees classifier as a final-stage classifier.

Download Full-text

Application of Ensemble Learning Using Weight Voting Protocol in the Prediction of Pile Bearing Capacity

Mathematical Problems in Engineering ◽

10.1155/2021/5558449 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Tuan Anh Pham ◽

Huong-Lan Thi Vu

Keyword(s):

Machine Learning ◽

Bearing Capacity ◽

Ensemble Learning ◽

Learning Algorithm ◽

Load Capacity ◽

Machine Learning Algorithms ◽

Load Test ◽

Gradient Boosting ◽

Foundation Engineering ◽

Expert Performance

Accurate prediction of pile bearing capacity is an important part of foundation engineering. Notably, the determination of pile bearing capacity through an in situ load test is costly and time-consuming. Therefore, this study focused on developing a machine learning algorithm, namely, Ensemble Learning (EL), using weight voting protocol of three base machine learning algorithms, gradient boosting (GB), random forest (RF), and classic linear regression (LR), to predict the bearing capacity of the pile. Data includes 108 pile load tests under different conditions used for model training and testing. Performance evaluation indicators such as R-square (R2), root mean square error (RMSE), and MAE (mean absolute error) were used to evaluate the performance of models showing the efficiency of predicting pile bearing capacity with outstanding performance compared to other models. The results also showed that the EL model with a weight combination of w 1 = 0.482, w 2 = 0.338, and w 3 = 0.18 corresponding to the models GB, RF, and LR gave the best performance and achieved the best balance on all data sets. In addition, the global sensitivity analysis technique was used to detect the most important input features in determining the bearing capacity of the pile. This study provides an effective tool to predict pile load capacity with expert performance.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

MALGRA: Machine Learning and N-Gram Malware Feature Extraction and Detection System

Electronics ◽

10.3390/electronics9111777 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1777

Author(s):

Muhammad Ali ◽

Stavros Shiaeles ◽

Gueltoum Bendiab ◽

Bogdan Ghita

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Detection System ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Detection Methods ◽

Normal Operation ◽

Analysis Technique ◽

N Gram

Detection and mitigation of modern malware are critical for the normal operation of an organisation. Traditional defence mechanisms are becoming increasingly ineffective due to the techniques used by attackers such as code obfuscation, metamorphism, and polymorphism, which strengthen the resilience of malware. In this context, the development of adaptive, more effective malware detection methods has been identified as an urgent requirement for protecting the IT infrastructure against such threats, and for ensuring security. In this paper, we investigate an alternative method for malware detection that is based on N-grams and machine learning. We use a dynamic analysis technique to extract an Indicator of Compromise (IOC) for malicious files, which are represented using N-grams. The paper also proposes TF-IDF as a novel alternative used to identify the most significant N-grams features for training a machine learning algorithm. Finally, the paper evaluates the proposed technique using various supervised machine-learning algorithms. The results show that Logistic Regression, with a score of 98.4%, provides the best classification accuracy when compared to the other classifiers used.

Download Full-text

Machine learning algorithm for early detection of end-stage renal disease

BMC Nephrology ◽

10.1186/s12882-020-02093-0 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Zvi Segal ◽

Dan Kalifa ◽

Kira Radinsky ◽

Bar Ehrenberg ◽

Guy Elad ◽

...

Keyword(s):

Machine Learning ◽

Renal Disease ◽

End Stage Renal Disease ◽

Predictive Value ◽

Hypertensive Crisis ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Severe Stage ◽

Stage Renal Disease ◽

End Stage

Abstract Background End stage renal disease (ESRD) describes the most severe stage of chronic kidney disease (CKD), when patients need dialysis or renal transplant. There is often a delay in recognizing, diagnosing, and treating the various etiologies of CKD. The objective of the present study was to employ machine learning algorithms to develop a prediction model for progression to ESRD based on a large-scale multidimensional database. Methods This study analyzed 10,000,000 medical insurance claims from 550,000 patient records using a commercial health insurance database. Inclusion criteria were patients over the age of 18 diagnosed with CKD Stages 1–4. We compiled 240 predictor candidates, divided into six feature groups: demographics, chronic conditions, diagnosis and procedure features, medication features, medical costs, and episode counts. We used a feature embedding method based on implementation of the Word2Vec algorithm to further capture temporal information for the three main components of the data: diagnosis, procedures, and medications. For the analysis, we used the gradient boosting tree algorithm (XGBoost implementation). Results The C-statistic for the model was 0.93 [(0.916–0.943) 95% confidence interval], with a sensitivity of 0.715 and specificity of 0.958. Positive Predictive Value (PPV) was 0.517, and Negative Predictive Value (NPV) was 0.981. For the top 1 percentile of patients identified by our model, the PPV was 1.0. In addition, for the top 5 percentile of patients identified by our model, the PPV was 0.71. All the results above were tested on the test data only, and the threshold used to obtain these results was 0.1. Notable features contributing to the model were chronic heart and ischemic heart disease as a comorbidity, patient age, and number of hypertensive crisis events. Conclusions When a patient is approaching the threshold of ESRD risk, a warning message can be sent electronically to the physician, who will initiate a referral for a nephrology consultation to ensure an investigation to hasten the establishment of a diagnosis and initiate management and therapy when appropriate.

Download Full-text

Validation and Generalizability of Machine Learning Prediction Models on Attrition in Longitudinal Studies

10.31234/osf.io/mzhvx ◽

2021 ◽

Author(s):

Kristin Jankowsky ◽

Ulrich Schroeders

Keyword(s):

Machine Learning ◽

Longitudinal Studies ◽

Prediction Models ◽

Learning Algorithm ◽

Missing At Random ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Unrealistic Assumption ◽

Nationally Representative ◽

Survey Attrition

Attrition in longitudinal studies is a major threat to the representativeness of the data and the generalizability of the findings. Typical approaches to address systematic nonresponse are either expensive and unsatisfactory (e.g., oversampling) or rely on the unrealistic assumption of data missing at random (e.g., multiple imputation). Thus, models that effectively predict who most likely drops out in subsequent occasions might offer the opportunity to take countermeasures (e.g., incentives). With the current study, we introduce a longitudinal model validation approach and examine whether attrition in two nationally representative longitudinal panel studies can be predicted accurately. We compare the performance of a basic logistic regression model to a more flexible, data-driven machine learning algorithm––Gradient Boosting Machines. Our results show almost no difference in accuracies for both modeling approaches, which contradicts claims of similar studies on survey attrition. Prediction models could not be generalized across surveys and were less accurate when tested at a later survey wave. We discuss the implications of these findings for survey retention, the use of complex machine learning algorithms, and give some recommendations to deal with study attrition.

Download Full-text

Review of machine learning algorithms' application in pharmaceutical technology

Arhiv za farmaciju ◽

10.5937/arhfarm71-32499 ◽

2021 ◽

Vol 71 (4) ◽

pp. 302-317

Author(s):

Jelena Đuriš ◽

Ivana Kurćubić ◽

Svetlana Ibrić

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Data Science ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Formulation Development ◽

Light Gradient ◽

Pharmaceutical Technology ◽

Wide Range

Machine learning algorithms, and artificial intelligence in general, have a wide range of applications in the field of pharmaceutical technology. Starting from the formulation development, through a great potential for integration within the Quality by design framework, these data science tools provide a better understanding of the pharmaceutical formulations and respective processing. Machine learning algorithms can be especially helpful with the analysis of the large volume of data generated by the Process analytical technologies. This paper provides a brief explanation of the artificial neural networks, as one of the most frequently used machine learning algorithms. The process of the network training and testing is described and accompanied with illustrative examples of machine learning tools applied in the context of pharmaceutical formulation development and related technologies, as well as an overview of the future trends. Recently published studies on more sophisticated methods, such as deep neural networks and light gradient boosting machine algorithm, have been described. The interested reader is also referred to several official documents (guidelines) that pave the way for a more structured representation of the machine learning models in their prospective submissions to the regulatory bodies.

Download Full-text

Deep Convolutional Neural Networks for Customer Churn Prediction Analysis

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.2020010101 ◽

2020 ◽

Vol 14 (1) ◽

pp. 1-16

Author(s):

Alae Chouiekh ◽

El Hassane Ibn El Haj

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Convolutional Neural Networks ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Use Case ◽

Churn Prediction ◽

Deep Convolutional Neural Networks ◽

Customer Churn

Several machine learning models have been proposed to address customer churn problems. In this work, the authors used a novel method by applying deep convolutional neural networks on a labeled dataset of 18,000 prepaid subscribers to classify/identify customer churn. The learning technique was based on call detail records (CDR) describing customers activity during two-month traffic from a real telecommunication provider. The authors use this method to identify new business use case by considering each subscriber as a single input image describing the churning state. Different experiments were performed to evaluate the performance of the method. The authors found that deep convolutional neural networks (DCNN) outperformed other traditional machine learning algorithms (support vector machines, random forest, and gradient boosting classifier) with F1 score of 91%. Thus, the use of this approach can reduce the cost related to customer loss and fits better the churn prediction business use case.

Download Full-text

Classification of hazelnut cultivars: comparison of DL4J and ensemble learning algorithms

Notulae Botanicae Horti Agrobotanici Cluj-Napoca ◽

10.15835/nbha48412041 ◽

2020 ◽

Vol 48 (4) ◽

pp. 2316-2327

Author(s):

Caner KOC ◽

Dilara GERDAN ◽

Maksut B. EMİNOĞLU ◽

Uğur YEGÜL ◽

Bulent KOC ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ensemble Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Performance Criteria ◽

Gradient Boosting ◽

Data Set

Classification of hazelnuts is one of the values adding processes that increase the marketability and profitability of its production. While traditional classification methods are used commonly, machine learning and deep learning can be implemented to enhance the hazelnut classification processes. This paper presents the results of a comparative study of machine learning frameworks to classify hazelnut (Corylus avellana L.) cultivars (‘Sivri’, ‘Kara’, ‘Tombul’) using DL4J and ensemble learning algorithms. For each cultivar, 50 samples were used for evaluations. Maximum length, width, compression strength, and weight of hazelnuts were measured using a caliper and a force transducer. Gradient boosting machine (Boosting), random forest (Bagging), and DL4J feedforward (Deep Learning) algorithms were applied in traditional machine learning algorithms. The data set was partitioned into a 10-fold-cross validation method. The classifier performance criteria of accuracy (%), error percentage (%), F-Measure, Cohen’s Kappa, recall, precision, true positive (TP), false positive (FP), true negative (TN), false negative (FN) values are provided in the results section. The results showed classification accuracies of 94% for Gradient Boosting, 100% for Random Forest, and 94% for DL4J Feedforward algorithms.

Download Full-text

Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.06.05 ◽

2021 ◽

Vol 13 (6) ◽

pp. 61-71

Author(s):

Isaac Kofi Nti ◽

◽

Owusu N yarko-Boateng ◽

Justice Aning

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Experimental Studies ◽

Area Under The Curve ◽

Essential Element ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Machine Learning Algorithm ◽

K Value

The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.

Download Full-text

Implementation of the solution to the oil displacement problem using machine learning classifiers and neural networks

Eastern-European Journal of Enterprise Technologies ◽

10.15587/1729-4061.2021.241858 ◽

2021 ◽

Vol 5 (4 (113)) ◽

pp. 55-63

Author(s):

Beimbet Daribayev ◽

Aksultan Mukhanbet ◽

Yedil Nurakhov ◽

Timur Imankulov

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Learning Algorithms ◽

High Accuracy ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Machine Learning Classifiers ◽

Oil Displacement ◽

Learning Classifiers

The problem of oil displacement was solved using neural networks and machine learning classifiers. The Buckley-Leverett model is selected, which describes the process of oil displacement by water. It consists of the equation of continuity of oil, water phases and Darcy’s law. The challenge is to optimize the oil displacement problem. Optimization will be performed at three levels: vectorization of calculations; implementation of classical algorithms; implementation of the algorithm using neural networks. A feature of the method proposed in the work is the identification of the method with high accuracy and the smallest errors, comparing the results of machine learning classifiers and types of neural networks. The research paper is also one of the first papers in which a comparison was made with machine learning classifiers and neural and recurrent neural networks. The classification was carried out according to three classification algorithms, such as decision tree, support vector machine (SVM) and gradient boosting. As a result of the study, the Gradient Boosting classifier and the neural network showed high accuracy, respectively 99.99 % and 97.4 %. The recurrent neural network trained faster than the others. The SVM classifier has the lowest accuracy score. To achieve this goal, a dataset was created containing over 67,000 data for class 10. These data are important for the problems of oil displacement in porous media. The proposed methodology provides a simple and elegant way to instill oil knowledge into machine learning algorithms. This removes two of the most significant drawbacks of machine learning algorithms: the need for large datasets and the robustness of extrapolation. The presented principles can be generalized in countless ways in the future and should lead to a new class of algorithms for solving both forward and inverse oil problems

Download Full-text