Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Abstract 15895: Machine Learning Algorithms to Predict Major Adverse Cardiovascular Events in Patients Undergoing Orthotopic Liver Transplantation: A Retrospective Cohort Study

Circulation ◽

10.1161/circ.142.suppl_3.15895 ◽

2020 ◽

Vol 142 (Suppl_3) ◽

Author(s):

vardhmaan jain ◽

Vikram Sharma ◽

Agam Bansal ◽

Cerise Kleb ◽

Chirag Sheth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Cardiovascular Events ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Major Adverse Cardiovascular Events ◽

Support Vector ◽

Post Transplant ◽

Extreme Gradient Boosting ◽

All Cause Mortality

Background: Post-transplant major adverse cardiovascular events (MACE) are amongst the leading cause of death amongst orthotopic liver transplant(OLT) recipients. Despite years of guideline directed therapy, there are limited data on predictors of post-OLT MACE. We assessed if machine learning algorithms (MLA) can predict MACE and all-cause mortality in patients undergoing OLT. Methods: We tested three MLA: support vector machine, extreme gradient boosting(XG-Boost) and random forest with traditional logistic regression for prediction of MACE and all-cause mortality on a cohort of consecutive patients undergoing OLT at our center between 2008-2019. The cohort was randomly split into a training (80%) and testing (20%) cohort. Model performance was assessed using c-statistic or AUC. Results: We included 1,459 consecutive patients with mean ± SD age 54.2 ± 13.8 years, 32% female who underwent OLT. There were 199 (13.6%) MACE and 289 (20%) deaths at a mean follow up of 4.56 ± 3.3 years. The random forest MLA was the best performing model for predicting MACE [AUC:0.78, 95% CI: 0.70-0.85] as well as mortality [AUC:0.69, 95% CI: 0.61-0.76], with all models performing better when predicting MACE vs mortality. See Table and Figure. Conclusion: Random forest machine learning algorithms were more predictive and discriminative than traditional regression models for predicting major adverse cardiovascular events and all-cause mortality in patients undergoing OLT. Validation and subsequent incorporation of MLA in clinical decision making for OLT candidacy could help risk stratify patients for post-transplant adverse cardiovascular events.

Download Full-text

Systematic Evaluation of Machine Learning Algorithms for Neuroanatomically-Based Age Prediction in Youth

10.1101/2021.11.24.469888 ◽

2021 ◽

Author(s):

Mandana Modabbernia ◽

Heather C Whalley ◽

David Glahn ◽

Paul M. Thompson ◽

Rene S. Kahn ◽

...

Keyword(s):

Machine Learning ◽

Computational Efficiency ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Sensitivity Analyses ◽

Gradient Boosting ◽

Support Vector ◽

Age Related ◽

Extreme Gradient Boosting ◽

Brain Age

Application of machine learning algorithms to structural magnetic resonance imaging (sMRI) data has yielded behaviorally meaningful estimates of the biological age of the brain (brain-age). The choice of the machine learning approach in estimating brain-age in children and adolescents is important because age-related brain changes in these age-groups are dynamic. However, the comparative performance of the multiple machine learning algorithms available has not been systematically appraised. To address this gap, the present study evaluated the accuracy (Mean Absolute Error; MAE) and computational efficiency of 21 machine learning algorithms using sMRI data from 2,105 typically developing individuals aged 5 to 22 years from five cohorts. The trained models were then tested in an independent holdout datasets, comprising 4,078 pre-adolescents (aged 9-10 years). The algorithms encompassed parametric and nonparametric, Bayesian, linear and nonlinear, tree-based, and kernel-based models. Sensitivity analyses were performed for parcellation scheme, number of neuroimaging input features, number of cross-validation folds, and sample size. The best performing algorithms were Extreme Gradient Boosting (MAE of 1.25 years for females and 1.57 years for males), Random Forest Regression (MAE of 1.23 years for females and 1.65 years for males) and Support Vector Regression with Radial Basis Function Kernel (MAE of 1.47 years for females and 1.72 years for males) which had acceptable and comparable computational efficiency. Findings of the present study could be used as a guide for optimizing methodology when quantifying age-related changes during development.

Download Full-text

Research on dairy products detection based on machine learning algorithm

MATEC Web of Conferences ◽

10.1051/matecconf/202235503008 ◽

2022 ◽

Vol 355 ◽

pp. 03008

Author(s):

Yang Zhang ◽

Lei Zhang ◽

Yabin Ma ◽

Jinsen Guan ◽

Zhaoxia Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Electronic Nose ◽

Milk Fat ◽

Dairy Products ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

In this study, an electronic nose model composed of seven kinds of metal oxide semiconductor sensors was developed to distinguish the milk source (the dairy farm to which milk belongs), estimate the content of milk fat and protein in milk, to identify the authenticity and evaluate the quality of milk. The developed electronic nose is a low-cost and non-destructive testing equipment. (1) For the identification of milk sources, this paper uses the method of combining the electronic nose odor characteristics of milk and the component characteristics to distinguish different milk sources, and uses Principal Component Analysis (PCA) and Linear Discriminant Analysis , LDA) for dimensionality reduction analysis, and finally use three machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to build a milk source (cow farm) Identify the model and evaluate and compare the classification effects. The experimental results prove that the classification effect of the SVM-LDA model based on the electronic nose odor characteristics is better than other single feature models, and the accuracy of the test set reaches 91.5%. The RF-LDA and SVM-LDA models based on the fusion feature of the two have the best effect Set accuracy rate is as high as 96%. (2) The three algorithms, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) and Random Forest (RF), are used to construct the electronic nose odor data for milk fat rate and protein rate. The method of estimating the model, the results show that the RF model has the best estimation performance( R2 =0.9399 for milk fat; R2=0.9301for milk protein). And it prove that the method proposed in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of dairy products.

Download Full-text

Bionic Electronic Nose Based on MOS Sensors Array and Machine Learning Algorithms Used for Wine Properties Detection

Sensors ◽

10.3390/s19010045 ◽

2018 ◽

Vol 19 (1) ◽

pp. 45 ◽

Cited By ~ 19

Author(s):

Huixiang Liu ◽

Qing Li ◽

Bin Yan ◽

Lei Zhang ◽

Yu Gu

Keyword(s):

Machine Learning ◽

Electronic Nose ◽

Optimal Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Oxide Semiconductor ◽

Gradient Boosting ◽

Support Vector ◽

Fermentation Processes ◽

Extreme Gradient Boosting

In this study, a portable electronic nose (E-nose) prototype is developed using metal oxide semiconductor (MOS) sensors to detect odors of different wines. Odor detection facilitates the distinction of wines with different properties, including areas of production, vintage years, fermentation processes, and varietals. Four popular machine learning algorithms—extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and backpropagation neural network (BPNN)—were used to build identification models for different classification tasks. Experimental results show that BPNN achieved the best performance, with accuracies of 94% and 92.5% in identifying production areas and varietals, respectively; and SVM achieved the best performance in identifying vintages and fermentation processes, with accuracies of 67.3% and 60.5%, respectively. Results demonstrate the effectiveness of the developed E-nose, which could be used to distinguish different wines based on their properties following selection of an optimal algorithm.

Download Full-text

Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran

Remote Sensing ◽

10.3390/rs12142234 ◽

2020 ◽

Vol 12 (14) ◽

pp. 2234 ◽

Cited By ~ 6

Author(s):

Mostafa Emadi ◽

Ruhollah Taghizadeh-Mehrjardi ◽

Ali Cherati ◽

Majid Danesh ◽

Amir Mosavi ◽

...

Keyword(s):

Machine Learning ◽

Organic Carbon ◽

Soil Organic Carbon ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Composite Surface ◽

Auxiliary Data ◽

Extreme Gradient Boosting

Estimation of the soil organic carbon (SOC) content is of utmost importance in understanding the chemical, physical, and biological functions of the soil. This study proposes machine learning algorithms of support vector machines (SVM), artificial neural networks (ANN), regression tree, random forest (RF), extreme gradient boosting (XGBoost), and conventional deep neural network (DNN) for advancing prediction models of SOC. Models are trained with 1879 composite surface soil samples, and 105 auxiliary data as predictors. The genetic algorithm is used as a feature selection approach to identify effective variables. The results indicate that precipitation is the most important predictor driving 14.9% of SOC spatial variability followed by the normalized difference vegetation index (12.5%), day temperature index of moderate resolution imaging spectroradiometer (10.6%), multiresolution valley bottom flatness (8.7%) and land use (8.2%), respectively. Based on 10-fold cross-validation, the DNN model reported as a superior algorithm with the lowest prediction error and uncertainty. In terms of accuracy, DNN yielded a mean absolute error of 0.59%, a root mean squared error of 0.75%, a coefficient of determination of 0.65, and Lin’s concordance correlation coefficient of 0.83. The SOC content was the highest in udic soil moisture regime class with mean values of 3.71%, followed by the aquic (2.45%) and xeric (2.10%) classes, respectively. Soils in dense forestlands had the highest SOC contents, whereas soils of younger geological age and alluvial fans had lower SOC. The proposed DNN (hidden layers = 7, and size = 50) is a promising algorithm for handling large numbers of auxiliary data at a province-scale, and due to its flexible structure and the ability to extract more information from the auxiliary data surrounding the sampled observations, it had high accuracy for the prediction of the SOC base-line map and minimal uncertainty.

Download Full-text

Soil Temperature Dynamics at Hillslope Scale—Field Observation and Machine Learning-Based Approach

Water ◽

10.3390/w12030713 ◽

2020 ◽

Vol 12 (3) ◽

pp. 713 ◽

Cited By ~ 2

Author(s):

Aliva Nanda ◽

Sumit Sen ◽

Awshesh Nath Sharma ◽

K. P. Sudheer

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Soil Temperature ◽

Land Surface ◽

Learning Algorithms ◽

Temperature Drop ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

Soil temperature plays an important role in understanding hydrological, ecological, meteorological, and land surface processes. However, studies related to soil temperature variability are very scarce in various parts of the world, especially in the Indian Himalayan Region (IHR). Thus, this study aims to analyze the spatio-temporal variability of soil temperature in two nested hillslopes of the lesser Himalaya and to check the efficiency of different machine learning algorithms to estimate soil temperature in the data-scarce region. To accomplish this goal, grassed (GA) and agro-forested (AgF) hillslopes were instrumented with Odyssey water level and decagon soil moisture and temperature sensors. The average soil temperature of the south aspect hillslope (i.e., GA hillslope) was higher than the north aspect hillslope (i.e., AgF hillslope). After analyzing 40 rainfall events from both hillslopes, it was observed that a rainfall duration of greater than 7.5 h or an event with an average rainfall intensity greater than 7.5 mm/h results in more than 2 °C soil temperature drop. Further, a drop in soil temperature less than 1 °C was also observed during very high-intensity rainfall which has a very short event duration. During the rainy season, the soil temperature drop of the GA hillslope is higher than the AgF hillslope as the former one infiltrates more water. This observation indicates the significant correlation between soil moisture rise and soil temperature drop. The potential of four machine learning algorithms was also explored in predicting soil temperature under data-scarce conditions. Among the four machine learning algorithms, an extreme gradient boosting system (XGBoost) performed better for both the hillslopes followed by random forests (RF), multilayer perceptron (MLP), and support vector machine (SVMs). The addition of rainfall to meteorological and meteorological + soil moisture datasets did not improve the models considerably. However, the addition of soil moisture to meteorological parameters improved the model significantly.

Download Full-text

RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction

F1000Research ◽

10.12688/f1000research.52350.2 ◽

2021 ◽

Vol 10 ◽

pp. 323

Author(s):

Thaís A.R. Ramos ◽

Nilbson R.O. Galindo ◽

Raúl Arias-Carrasco ◽

Cecília F. da Silva ◽

Vinicius Maracaja-Coutinho ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Web Server ◽

Machine Learning Algorithms ◽

Model Organisms ◽

Gradient Boosting ◽

Sequence Length ◽

Support Vector ◽

Coding Sequences ◽

Extreme Gradient Boosting

Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied seven machine learning algorithms (Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Random Forest, Extreme Gradient Boosting, Neural Networks and Deep Learning) through model organisms from different evolutionary branches to create a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences. Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their trinucleotides count analysed (64 features) and we performed a normalization by the sequence length, resulting in total of 180 models. The machine learning algorithms validations were performed using 10-fold cross-validation and we selected the algorithm with the best results (eXtreme Gradient Boosting) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and TransDecoder) and our results outperformed them. Both stand-alone and web server versions of RNAmining are freely available at https://rnamining.integrativebioinformatics.me/.

Download Full-text

IgA Nephropathy Prediction in Children with Machine Learning Algorithms

Future Internet ◽

10.3390/fi12120230 ◽

2020 ◽

Vol 12 (12) ◽

pp. 230

Author(s):

Ping Zhang ◽

Rongqin Wang ◽

Nianfeng Shi

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Immunoglobulin A ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Chi Square ◽

Extreme Gradient Boosting

Immunoglobulin A nephropathy (IgAN) is the most common primary glomerular disease all over the world and it is a major cause of renal failure. IgAN prediction in children with machine learning algorithms has been rarely studied. We retrospectively analyzed the electronic medical records from the Nanjing Eastern War Zone Hospital, chose eXtreme Gradient Boosting (XGBoost), random forest (RF), CatBoost, support vector machines (SVM), k-nearest neighbor (KNN), and extreme learning machine (ELM) models in order to predict the probability that the patient would not reach or reach end-stage renal disease (ESRD) within five years, used the chi-square test to select the most relevant 16 features as the input of the model, and designed a decision-making system (DMS) of IgAN prediction in children that is based on XGBoost and Django framework. The receiver operating characteristic (ROC) curve was used in order to evaluate the performance of the models and XGBoost had the best performance by comparison. The AUC value, accuracy, precision, recall, and f1-score of XGBoost were 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. The XGBoost model is useful for physicians and pediatric patients in providing predictions regarding IgAN. As an advantage, a DMS can be designed based on the XGBoost model to assist a physician to effectively treat IgAN in children for preventing deterioration.

Download Full-text

Machine Learning-Enabled 30-Day Readmission Model for Stroke Patients

Frontiers in Neurology ◽

10.3389/fneur.2021.638267 ◽

2021 ◽

Vol 12 ◽

Author(s):

Negar Darabi ◽

Niyousha Hosseinichimeh ◽

Anthony Noto ◽

Ramin Zand ◽

Vida Abedi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Ischemic Stroke ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Percutaneous Gastrostomy ◽

Targeted Interventions ◽

Clinical Variables ◽

Extreme Gradient Boosting

Background and Purpose: Hospital readmissions impose a substantial burden on the healthcare system. Reducing readmissions after stroke could lead to improved quality of care especially since stroke is associated with a high rate of readmission. The goal of this study is to enhance our understanding of the predictors of 30-day readmission after ischemic stroke and develop models to identify high-risk individuals for targeted interventions.Methods: We used patient-level data from electronic health records (EHR), five machine learning algorithms (random forest, gradient boosting machine, extreme gradient boosting–XGBoost, support vector machine, and logistic regression-LR), data-driven feature selection strategy, and adaptive sampling to develop 15 models of 30-day readmission after ischemic stroke. We further identified important clinical variables.Results: We included 3,184 patients with ischemic stroke (mean age: 71 ± 13.90 years, men: 51.06%). Among the 61 clinical variables included in the model, the National Institutes of Health Stroke Scale score above 24, insert indwelling urinary catheter, hypercoagulable state, and percutaneous gastrostomy had the highest importance score. The Model's AUC (area under the curve) for predicting 30-day readmission was 0.74 (95%CI: 0.64–0.78) with PPV of 0.43 when the XGBoost algorithm was used with ROSE-sampling. The balance between specificity and sensitivity improved through the sampling strategy. The best sensitivity was achieved with LR when optimized with feature selection and ROSE-sampling (AUC: 0.64, sensitivity: 0.53, specificity: 0.69).Conclusions: Machine learning-based models can be designed to predict 30-day readmission after stroke using structured data from EHR. Among the algorithms analyzed, XGBoost with ROSE-sampling had the best performance in terms of AUC while LR with ROSE-sampling and feature selection had the best sensitivity. Clinical variables highly associated with 30-day readmission could be targeted for personalized interventions. Depending on healthcare systems' resources and criteria, models with optimized performance metrics can be implemented to improve outcomes.

Download Full-text