Cyberattacks Detection and Analysis in a Network Log System Using XGBoost with ELK Stack

Abstract The usage of artificial intelligence and machine learning methods on cyberattacks increasing significantly recently. For the defense method of cyberattacks, it is possible to detect and identify the attack event by observing the log data and analyzing whether it has abnormal behavior or not. This paper implemented the ELK Stack network log system (NetFlow Log) to visually analyze log data and present several network attack behavior characteristics for further analysis. Additionally, this system evaluated the extreme gradient enhancement (XGBoost), Recurrent Neural Network (RNN), and Deep Neural Network (DNN) model for machine learning methods. Keras was used as a deep learning framework for building a model to detect the attack event. From the experiments, it can be confirmed that the XGBoost model has an accuracy rate of 96.01% for potential threats. The full attack data set can achieve 96.26% accuracy, which is better than RNN and DNN models.

Download Full-text

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Download Full-text

Landslide susceptibility mapping based on convolutional neural network and conventional machine learning methods

10.21203/rs.3.rs-190195/v1 ◽

2021 ◽

Author(s):

Rui Liu ◽

Xin Yang ◽

Chong Xu ◽

Luyao Li ◽

Xiangqiang Zeng

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Landslide Susceptibility ◽

Susceptibility Mapping ◽

Landslide Susceptibility Mapping ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Conventional Machine

Abstract Landslide susceptibility mapping (LSM) is a useful tool to estimate the probability of landslide occurrence, providing a scientific basis for natural hazards prevention, land use planning, and economic development in landslide-prone areas. To date, a large number of machine learning methods have been applied to LSM, and recently the advanced Convolutional Neural Network (CNN) has been gradually adopted to enhance the prediction accuracy of LSM. The objective of this study is to introduce a CNN based model in LSM and systematically compare its overall performance with the conventional machine learning models of random forest, logistic regression, and support vector machine. Herein, we selected the Jiuzhaigou region in Sichuan Province, China as the study area. A total number of 710 landslides and 12 predisposing factors were stacked to form spatial datasets for LSM. The ROC analysis and several statistical metrics, such as accuracy, root mean square error (RMSE), Kappa coefficient, sensitivity, and specificity were used to evaluate the performance of the models in the training and validation datasets. Finally, the trained models were calculated and the landslide susceptibility zones were mapped. Results suggest that both CNN and conventional machine-learning based models have a satisfactory performance (AUC: 85.72% − 90.17%). The CNN based model exhibits excellent good-of-fit and prediction capability, and achieves the highest performance (AUC: 90.17%) but also significantly reduces the salt-of-pepper effect, which indicates its great potential of application to LSM.

Download Full-text

Detecting Items with the Biggest Weight Based on Neural Network and Machine Learning Methods

Communications in Computer and Information Science - Data Stream Mining & Processing ◽

10.1007/978-3-030-61656-4_26 ◽

2020 ◽

pp. 383-396

Author(s):

Vitaliy Danylyk ◽

Victoria Vysotska ◽

Vasyl Lytvyn ◽

Svitlana Vyshemyrska ◽

Iryna Lurie ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

Convolutional Neural Network Model in Machine Learning Methods and Computer Vision for Image Recognition: A Review

Journal of Applied Sciences Research ◽

10.22587/jasr.2018.14.6.5 ◽

2018 ◽

Keyword(s):

Neural Network ◽

Machine Learning ◽

Computer Vision ◽

Convolutional Neural Network ◽

Network Model ◽

Image Recognition ◽

Neural Network Model ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

Automatic Misinformation Detection About COVID-19 in Brazilian Portuguese WhatsApp Messages

10.5753/sbbd_estendido.2021.18173 ◽

2021 ◽

Author(s):

Antônio Diogo Forte Martins ◽

José Maria Monteiro ◽

Javam Machado

Keyword(s):

Machine Learning ◽

Social Networks ◽

Brazilian Portuguese ◽

Primary Sources ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

During the coronavirus pandemic, the problem of misinformation arose once again, quite intensely, through social networks. In Brazil, one of the primary sources of misinformation is the messaging application WhatsApp. However, due to WhatsApp's private messaging nature, there still few methods of misinformation detection developed specifically for this platform. In this context, the automatic misinformation detection (MID) about COVID-19 in Brazilian Portuguese WhatsApp messages becomes a crucial challenge. In this work, we present the COVID-19.BR, a data set of WhatsApp messages about coronavirus in Brazilian Portuguese, collected from Brazilian public groups and manually labeled. Then, we are investigating different machine learning methods in order to build an efficient MID for WhatsApp messages. So far, our best result achieved an F1 score of 0.774 due to the predominance of short texts. However, when texts with less than 50 words are filtered, the F1 score rises to 0.85.

Download Full-text

Toward Optimal Heparin Dosing by Comparing Multiple Machine Learning Methods: Retrospective Study (Preprint)

10.2196/preprints.17648 ◽

2019 ◽

Author(s):

Longxiang Su ◽

Chun Liu ◽

Dongkai Li ◽

Jie He ◽

Fanglan Zheng ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Intensive Care Unit ◽

Intensive Care ◽

Data Set ◽

Machine Learning Methods ◽

Heparin Treatment ◽

Neural Network Algorithm ◽

Mimic Iii ◽

Machine Learning Models

BACKGROUND Heparin is one of the most commonly used medications in intensive care units. In clinical practice, the use of a weight-based heparin dosing nomogram is standard practice for the treatment of thrombosis. Recently, machine learning techniques have dramatically improved the ability of computers to provide clinical decision support and have allowed for the possibility of computer generated, algorithm-based heparin dosing recommendations. OBJECTIVE The objective of this study was to predict the effects of heparin treatment using machine learning methods to optimize heparin dosing in intensive care units based on the predictions. Patient state predictions were based upon activated partial thromboplastin time in 3 different ranges: subtherapeutic, normal therapeutic, and supratherapeutic, respectively. METHODS Retrospective data from 2 intensive care unit research databases (Multiparameter Intelligent Monitoring in Intensive Care III, MIMIC-III; e–Intensive Care Unit Collaborative Research Database, eICU) were used for the analysis. Candidate machine learning models (random forest, support vector machine, adaptive boosting, extreme gradient boosting, and shallow neural network) were compared in 3 patient groups to evaluate the classification performance for predicting the subtherapeutic, normal therapeutic, and supratherapeutic patient states. The model results were evaluated using precision, recall, F1 score, and accuracy. RESULTS Data from the MIMIC-III database (n=2789 patients) and from the eICU database (n=575 patients) were used. In 3-class classification, the shallow neural network algorithm performed the best (F1 scores of 87.26%, 85.98%, and 87.55% for data set 1, 2, and 3, respectively). The shallow neural network algorithm achieved the highest F1 scores within the patient therapeutic state groups: subtherapeutic (data set 1: 79.35%; data set 2: 83.67%; data set 3: 83.33%), normal therapeutic (data set 1: 93.15%; data set 2: 87.76%; data set 3: 84.62%), and supratherapeutic (data set 1: 88.00%; data set 2: 86.54%; data set 3: 95.45%) therapeutic ranges, respectively. CONCLUSIONS The most appropriate model for predicting the effects of heparin treatment was found by comparing multiple machine learning models and can be used to further guide optimal heparin dosing. Using multicenter intensive care unit data, our study demonstrates the feasibility of predicting the outcomes of heparin treatment using data-driven methods, and thus, how machine learning–based models can be used to optimize and personalize heparin dosing to improve patient safety. Manual analysis and validation suggested that the model outperformed standard practice heparin treatment dosing.

Download Full-text

Comparison of machine learning methods for crack localization

Acta et Commentationes Universitatis Tartuensis de Mathematica ◽

10.12697/acutm.2019.23.13 ◽

2019 ◽

Vol 23 (1) ◽

pp. 125-142

Author(s):

Helle Hein ◽

Ljubov Jaanuska

Keyword(s):

Machine Learning ◽

Random Forests ◽

Crack Depth ◽

Haar Wavelet ◽

Extensive Investigation ◽

Learning Methods ◽

Data Set ◽

Crack Location ◽

Machine Learning Methods ◽

Discrete Transform

In this paper, the Haar wavelet discrete transform, the artificial neural networks (ANNs), and the random forests (RFs) are applied to predict the location and severity of a crack in an Euler–Bernoulli cantilever subjected to the transverse free vibration. An extensive investigation into two data collection sets and machine learning methods showed that the depth of a crack is more difficult to predict than its location. The data set of eight natural frequency parameters produces more accurate predictions on the crack depth; meanwhile, the data set of eight Haar wavelet coefficients produces more precise predictions on the crack location. Furthermore, the analysis of the results showed that the ensemble of 50 ANN trained by Bayesian regularization and Levenberg–Marquardt algorithms slightly outperforms RF.

Download Full-text

Modelling of diesel engine performance using advanced machine learning methods under scarce and exponential data set

Applied Soft Computing ◽

10.1016/j.asoc.2013.06.006 ◽

2013 ◽

Vol 13 (11) ◽

pp. 4428-4441 ◽

Cited By ~ 25

Author(s):

Ka In Wong ◽

Pak Kin Wong ◽

Chun Shun Cheung ◽

Chi Man Vong

Keyword(s):

Machine Learning ◽

Diesel Engine ◽

Engine Performance ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Download Full-text

A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem

Business Systems Research Journal ◽

10.2478/bsrj-2014-0021 ◽

2014 ◽

Vol 5 (3) ◽

pp. 82-96 ◽

Cited By ~ 3

Author(s):

Marijana Zekić-Sušac ◽

Sanja Pfeifer ◽

Nataša Šarlija

Keyword(s):

Neural Network ◽

Machine Learning ◽

Classification Accuracy ◽

Classification Problem ◽

High Dimensional ◽

Nearest Neighbour ◽

Learning Methods ◽

Machine Learning Methods ◽

Dimensional Classification ◽

Artificial Neural

Abstract Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.

Download Full-text

Monitoring the Variation of Vegetation Water Content with Machine Learning Methods: Point–Surface Fusion of MODIS Products and GNSS-IR Observations

Remote Sensing ◽

10.3390/rs11121440 ◽

2019 ◽

Vol 11 (12) ◽

pp. 1440 ◽

Cited By ~ 1

Author(s):

Qiangqiang Yuan ◽

Shuwen Li ◽

Linwei Yue ◽

Tongwen Li ◽

Huanfeng Shen ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Water Content ◽

Linear Regression Method ◽

Learning Methods ◽

Machine Learning Methods ◽

Vegetation Water Content ◽

Drought Prediction ◽

Vegetation Water ◽

Surface Fusion

Vegetation water content (VWC) is recognized as an important parameter in vegetation growth studies, natural disasters such as forest fires, and drought prediction. Recently, the Global Navigation Satellite System Interferometric Reflectometry (GNSS-IR) has emerged as an important technique for monitoring vegetation information. The normalized microwave reflection index (NMRI) was developed to reflect the change of VWC based on this fact. However, NMRI uses local site-based data, and the sparse distribution hinders the application of NMRI. In this study, we obtained a 500 m spatially continuous NMRI product by integrating GNSS-IR site data with other VWC-related products using the point–surface fusion technique. The auxiliary data in the fusion process include the normalized difference vegetation index (NDVI), gross primary productivity (GPP), and precipitation. Meanwhile, the fusion performance of three machine learning methods, i.e., the back-propagation neural network (BPNN), generalized regression neural network (GRNN), and random forest (RF) are compared and analyzed. The machine learning methods achieve satisfactory results, with cross-validation R values of 0.71–0.83 and RMSEs of 0.025–0.037. The results show a clear improvement over the traditional multiple linear regression method, which achieves R (RMSE) values of only about 0.4 (0.045). It indicates that the machine learning methods can better learn the complex nonlinear relationship between NMRI and the input VWC-related index. Among the machine learning methods, the RF model obtained the best results. Long time-series NMRI images with a 500 m spatial resolution in the western part of the continental U.S. were then obtained. The results show that the spatial distribution of the NMRI product is consistent with a drought situation from 2012 to 2014 in the U.S., which verifies the feasibility of analyzing and predicting drought times and distribution ranges by using the 500 m fusion product.

Download Full-text