Optimizing the Performance of Machine Learning Based Traffic Classification

Traffic classification is a critical technology in the areas of network management and security monitoring. Traditional port-based and payload-based classification are no longer effective due to the fact that many applications utilize unpredictable port numbers and packet encryption. Researchers tend to apply machine learning (ML) techniques to identify the traffic flows by recognizing statistical features. Unfortunately, looking back upon the related work, most of the ML-based classification algorithms have similar performance, and what really matters now is how to optimize these techniques. In this paper, we analyzed two critical issues (Feature Selection, Configuration of Parameters) of ML classification, and presented the corresponding viable methods to optimize the classification model. This paper also reported the experimental evaluation to assess the performance improvements introduced by our optimized methods; experimental results on real-life datasets and network traffic show that the classification model successfully achieves significant accuracy improvement.

Download Full-text

The Accuracy Improvement of Text Mining Classification on Hospital Review through The Alteration in The Preprocessing Stage

International Journal of Computer and Information Technology(2279-0764) ◽

10.24203/ijcit.v10i4.138 ◽

2021 ◽

Vol 10 (4) ◽

Author(s):

Triyas Hevianto Saputro ◽

Arief Hermawan

Keyword(s):

Machine Learning ◽

Text Mining ◽

Sentiment Analysis ◽

Text Classification ◽

Classification Model ◽

Training Process ◽

Accuracy Improvement ◽

Spelling Correction ◽

Preprocessing Technique ◽

Selection Of

Sentiment analysis is a part of text mining used to dig up information from a sentence or document. This study focuses on text classification for the purpose of a sentiment analysis on hospital review by customers through criticism and suggestion on Google Maps Review. The data of texts collected still contain a lot of nonstandard words. These nonstandard words cause problem in the preprocessing stage. Thus, the selection and combination of techniques in the preprocessing stage emerge as something crucial for the accuracy improvement in the computation of machine learning. However, not all of the techniques in the preprocessing stage can contribute to improve the accuracy on classification machine. The objective of this study is to improve the accuracy of classification model on hospital review by customers for a sentiment analysis modeling. Through the implementation of the preprocessing technique combination, it can produce a highly accurate classification model. This study experimented with several preprocessing techniques: (1) tokenization, (2) case folding, (3) stop words removal, (4) stemming, and (5) removing punctuation and number. The experiment was done by adding the preprocessing methods: (1) spelling correction and (2) Slang. The result shows that spelling correction and Slang method can assist for improving the accuracy value. Furthermore, the selection of suitable preprocessing technique combination can fasten the training process to produce the more ideal text classification model.

Download Full-text

An Assessment of Intrusion Detection using Machine Learning on Traffic Statistical Data

10.36227/techrxiv.14837994.v1 ◽

2021 ◽

Author(s):

Qianru Zhou ◽

Rongzhen Li ◽

Lei Xu ◽

Hongyi Zhu ◽

Wanli Liu

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Statistical Data ◽

False Positive Rate ◽

Real Life ◽

Classification Model ◽

Training Dataset ◽

Machine Learning Classification ◽

Open Datasets ◽

Time Overhead

<div> <div> <div> <p>Detecting Zero-Day intrusions has been the goal of Cybersecurity, especially intrusion detection for a long time. Machine learning is believed to be the promising methodology to solve that problem, numerous models have been proposed but a practical solution is still yet to come, mainly due to the limitation caused by the out-of-date open datasets available. In this paper, we propose an approach for Zero-Day intrusion detection based on machine learning, using flow-based statistical data generated by CICFlowMeter as training dataset. The machine learning classification model used is selected from eight most popular classification models, based on their cross validation results, in terms of precision, recall, F1 value, area under curve (AUC) and time overhead. Finally, the proposed system is tested on the testing dataset. To evaluate the feasibility and efficiency of tested models, the testing datasets are designed to contain novel types of intrusions (intrusions have not been trained during the training process). The normal data in the datasets are generated from real life traffic flows generated from daily use. Promising results have been received with the accuracy as high as almost 100%, false positive rate as low as nearly 0%, and with a reasonable time overhead. We argue that with the proper selected flow based statistical data, certain machine learning models such as MLP classifier, Quadratic discriminant analysis, K-Neighbor classifier have satisfying performance in detecting Zero-Day attacks. </p> </div> </div> </div>

Download Full-text

ANALYSIS OF THE INFLUENCE OF MACHINE LEARNING ALGORITHM PARAMETERS ON THE RESULTS OF TRAFFIC CLASSIFICATION IN REAL TIME

T-Comm ◽

10.36724/2072-8735-2021-15-9-24-35 ◽

2021 ◽

Vol 15 (9) ◽

pp. 24-35

Author(s):

Irina A. Krasnova ◽

Keyword(s):

Machine Learning ◽

Random Forest ◽

Real Time ◽

Experimental Studies ◽

Machine Learning Algorithms ◽

Classification Model ◽

Traffic Classification ◽

Data Set ◽

Minimum Number ◽

The Impact

The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.

Download Full-text

Condition Monitoring in Photovoltaic Systems by Semi-Supervised Machine Learning

Energies ◽

10.3390/en13030584 ◽

2020 ◽

Vol 13 (3) ◽

pp. 584

Author(s):

Lars Maaløe ◽

Ole Winther ◽

Sergiu Spataru ◽

Dezso Sera

Keyword(s):

Machine Learning ◽

Condition Monitoring ◽

Real Life ◽

Feature Representation ◽

Photovoltaic System ◽

Classification Model ◽

Supervised Machine Learning ◽

Sensor Data ◽

Maximum Throughput ◽

Low Dimensional

With the rapid increase in photovoltaic energy production, there is a need for smart condition monitoring systems ensuring maximum throughput. Complex methods such as drone inspections are costly and labor intensive; hence, condition monitoring by utilizing sensor data is attractive. In order to recognize meaningful patterns from the sensor data, there is a need for expressive machine learning models. However, supervised machine learning, e.g., regression models, suffer from the cumbersome process of annotating data. By utilizing a recent state-of-the-art semi-supervised machine learning based on probabilistic modeling, we were able to perform condition monitoring in a photovoltaic system with high accuracy and only a small fraction of annotated data. The modeling approach utilizes all the unsupervised data by jointly learning a low-dimensional feature representation and a classification model in an end-to-end fashion. By analysis of the feature representation, new internal condition monitoring states can be detected, proving a practical way of updating the model for better monitoring. We present (i) an analysis that compares the proposed model to corresponding purely supervised approaches, (ii) a study on the semi-supervised capabilities of the model, and (iii) an experiment in which we simulated a real-life condition monitoring system.

Download Full-text

An Assessment of Intrusion Detection using Machine Learning on Traffic Statistical Data

10.36227/techrxiv.14837994 ◽

2021 ◽

Author(s):

Qianru Zhou ◽

Rongzhen Li ◽

Lei Xu ◽

Hongyi Zhu ◽

Wanli Liu

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Statistical Data ◽

False Positive Rate ◽

Real Life ◽

Classification Model ◽

Training Dataset ◽

Machine Learning Classification ◽

Open Datasets ◽

Time Overhead

Download Full-text

Demonstration of machine-learning-assisted security monitoring in optical networks

45th European Conference on Optical Communication (ECOC 2019) ◽

10.1049/cp.2019.1189 ◽

2019 ◽

Author(s):

M. Furdek ◽

C. Natalino ◽

F. Lipp ◽

D. Hock ◽

N. Aerts ◽

...

Keyword(s):

Machine Learning ◽

Optical Networks ◽

Security Monitoring

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Machine Learning Based Assembly of Fragments of Ancient Papyrus

Journal on Computing and Cultural Heritage ◽

10.1145/3460961 ◽

2021 ◽

Vol 14 (3) ◽

pp. 1-21

Author(s):

Roy Abitbol ◽

Ilan Shimshoni ◽

Jonathan Ben-Dov

Keyword(s):

Machine Learning ◽

Spatial Information ◽

Real Life ◽

Dead Sea ◽

Machine Learning Techniques ◽

Automated Classification ◽

Learning Techniques ◽

Test Batch ◽

Validation Set ◽

Local Edge

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.

Download Full-text

Using Machine Learning for Quantum Annealing Accuracy Prediction

Algorithms ◽

10.3390/a14060187 ◽

2021 ◽

Vol 14 (6) ◽

pp. 187

Author(s):

Aaron Barbosa ◽

Elijah Pelofske ◽

Georg Hahn ◽

Hristo N. Djidjev

Keyword(s):

Machine Learning ◽

Maximum Clique ◽

Classification Model ◽

Maximum Clique Problem ◽

Problem Instance ◽

Np Hard ◽

Machine Learning Classification ◽

Hard Problems ◽

Problem Instances ◽

D Wave

Quantum annealers, such as the device built by D-Wave Systems, Inc., offer a way to compute solutions of NP-hard problems that can be expressed in Ising or quadratic unconstrained binary optimization (QUBO) form. Although such solutions are typically of very high quality, problem instances are usually not solved to optimality due to imperfections of the current generations quantum annealers. In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance, and to use machine learning models to predict the accuracy of the D-Wave 2000Q annealer for solving specific problems. We focus on the maximum clique problem, a classic NP-hard problem with important applications in network analysis, bioinformatics, and computational chemistry. By training a machine learning classification model on basic problem characteristics such as the number of edges in the graph, or annealing parameters, such as the D-Wave’s chain strength, we are able to rank certain features in the order of their contribution to the solution hardness, and present a simple decision tree which allows to predict whether a problem will be solvable to optimality with the D-Wave 2000Q. We extend these results by training a machine learning regression model that predicts the clique size found by D-Wave.

Download Full-text