Identifying Botnet on IoT by Using Supervised Learning Techniques

Amirhossein Rezaei

doi:10.13005/ojcst12.04.04

Identifying Botnet on IoT by Using Supervised Learning Techniques

Oriental journal of computer science and technology ◽

10.13005/ojcst12.04.04 ◽

2019 ◽

Vol 12 (4) ◽

pp. 185-193

Author(s):

Amirhossein Rezaei

Keyword(s):

Supervised Learning ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Malicious Software ◽

Learning Techniques ◽

Security Challenges ◽

Learning Technique ◽

Security Challenge ◽

The Moment

The security challenge on IoT (Internet of Things) is one of the hottest and most pertinent topics at the moment especially the several security challenges. The Botnet is one of the security challenges that most impact for several purposes. The network of private computers infected by malicious software and controlled as a group without the knowledge of owners and each of them running one or more bots is called Botnets. Normally, it is used for sending spam, stealing data, and performing DDoS attacks. One of the techniques that been used for detecting the Botnet is the Supervised Learning method. This study will examine several Supervised Learning methods such as; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, k- Nearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT with the aim of finding which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable.

Download Full-text

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

South African Computer Journal ◽

10.18489/sacj.v32i2.847 ◽

2020 ◽

Vol 32 (2) ◽

Author(s):

Oluwafemi Oriola ◽

Eduan Kotzé

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

South African ◽

Learning Curves ◽

Training Data ◽

Support Vector ◽

Learning Techniques ◽

Learning Technique ◽

Unlabelled Data ◽

Language Detection

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

Download Full-text

Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

Information ◽

10.3390/info11040210 ◽

2020 ◽

Vol 11 (4) ◽

pp. 210

Author(s):

Taufik Fuadi Abidin ◽

Amir Mahazir ◽

Muhammad Subianto ◽

Khairul Munadi ◽

Ridha Ferdhiana

Keyword(s):

Supervised Learning ◽

Polynomial Model ◽

Support Vector ◽

Entity Extraction ◽

K Nearest Neighbors ◽

Research Attention ◽

Considerable Research ◽

Learning Techniques ◽

Hadoop Cluster ◽

Candidate Identification

During the previous decades, intelligent identification of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the fields of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identification; (2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classification. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers five features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers five features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.

Download Full-text

Evaluation of Machine Learning Algorithms for Classification of Primary Biological Aerosol using a new UV-LIF spectrometer

10.5194/amt-2016-214 ◽

2016 ◽

Cited By ~ 1

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Supervised Learning ◽

Fungal Spores ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Agglomerative Clustering ◽

Real World Data ◽

Linear Discriminant ◽

Accuracy Of Measurements

Abstract. Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. Clustering, in general performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets.

Download Full-text

User Classification on Online Social Networks by Post Frequency

10.5753/sbsi.2017.6076 ◽

2017 ◽

Cited By ~ 2

Author(s):

Gabriel Tavares ◽

Saulo Mastelini ◽

Sylvio Jr.

Keyword(s):

Social Networks ◽

Language Processing ◽

Online Social Networks ◽

Computational Cost ◽

Real Life ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

User Classification ◽

Extreme Gradient Boosting

This paper proposes a technique for classifying user accounts on social networks to detect fraud in Online Social Networks (OSN). The main purpose of our classification is to recognize the patterns of users from Human, Bots or Cyborgs. Classic and consolidated approaches of Text Mining employ textual features from Natural Language Processing (NLP) for classification, but some drawbacks as computational cost, the huge amount of data could rise in real-life scenarios. This work uses an approach based on statistical frequency parameters of the user posting to distinguish the types of users without textual content. We perform the experiment over a Twitter dataset and as learn-based algorithms in classification task we compared Random Forest (RF), Support Vector Machine (SVM), k-nearest Neighbors (k-NN), Gradient Boosting Machine (GBM) and Extreme Gradient Boosting (XGBoost). Using the standard parameters of each algorithm, we achieved accuracy results of 88% and 84% by RF and XGBoost, respectively

Download Full-text

Perbandingan Algoritma k-Nearest Neighbors (k-NN) dan Support Vector Machines (SVM) untuk Klasifikasi Pengenalan Citra Wajah

Jurnal ICT : Information Communication & Technology ◽

10.36054/jict-ikmi.v20i1.354 ◽

2021 ◽

Vol 20 (1) ◽

pp. 186-191

Author(s):

Parasian DP Silitonga ◽

Romanus Damanik

Keyword(s):

Image Recognition ◽

Machine Learning Algorithms ◽

Public Image ◽

Support Vector ◽

Facial Image ◽

K Nearest Neighbors ◽

Svm Algorithm ◽

Vector Machines ◽

Significant Research ◽

The Moment

Abstract- The study of face recognition is one of the areas of computer vision that requires significant research at the moment. Numerous researchers have conducted studies on facial image recognition using a variety of techniques or methods to achieve the highest level of accuracy possible when recognizing a person's face from existing images. However, recognizing the image of a human face is not easy for a computer. As a result, several approaches were taken to resolve this issue. This study compares two (two) machine learning algorithms for facial image recognition to determine which algorithm has the highest level of accuracy, precision, recall, and AUC. The comparison is carried out in the following steps: image acquisition, preprocessing, feature extraction, face classification, training, and testing. Based on the stages and experiments conducted on public image datasets, it is concluded that the SVM algorithm, on average, has a higher level of accuracy, precision, and recall than the k-NN algorithm when the dataset proportion is 90:10. While the k-NN algorithm has the highest similarity in terms of accuracy, precision, and recall at 80%: 20% and 70%: 30% of 99.20. However, for the highest AUC percentage level, the k-NN algorithm outperforms SVM at a dataset proportion of 80%: 20% at 100%.

Download Full-text

Classification of Individual Finger Movements from Right Hand Using fNIRS Signals

Sensors ◽

10.3390/s21237943 ◽

2021 ◽

Vol 21 (23) ◽

pp. 7943

Author(s):

Haroon Khan ◽

Farzan M. Noori ◽

Anis Yazidi ◽

Md Zia Uddin ◽

M. N. Afzal Khan ◽

...

Keyword(s):

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Finger Tapping ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Physiological Noise ◽

Individual Finger ◽

Extreme Gradient Boosting ◽

Depth Analysis

Functional near-infrared spectroscopy (fNIRS) is a comparatively new noninvasive, portable, and easy-to-use brain imaging modality. However, complicated dexterous tasks such as individual finger-tapping, particularly using one hand, have been not investigated using fNIRS technology. Twenty-four healthy volunteers participated in the individual finger-tapping experiment. Data were acquired from the motor cortex using sixteen sources and sixteen detectors. In this preliminary study, we applied standard fNIRS data processing pipeline, i.e. optical densities conversation, signal processing, feature extraction, and classification algorithm implementation. Physiological and non-physiological noise is removed using 4th order band-pass Butter-worth and 3rd order Savitzky–Golay filters. Eight spatial statistical features were selected: signal-mean, peak, minimum, Skewness, Kurtosis, variance, median, and peak-to-peak form data of oxygenated haemoglobin changes. Sophisticated machine learning algorithms were applied, such as support vector machine (SVM), random forests (RF), decision trees (DT), AdaBoost, quadratic discriminant analysis (QDA), Artificial neural networks (ANN), k-nearest neighbors (kNN), and extreme gradient boosting (XGBoost). The average classification accuracies achieved were 0.75±0.04, 0.75±0.05, and 0.77±0.06 using k-nearest neighbors (kNN), Random forest (RF) and XGBoost, respectively. KNN, RF and XGBoost classifiers performed exceptionally well on such a high-class problem. The results need to be further investigated. In the future, a more in-depth analysis of the signal in both temporal and spatial domains will be conducted to investigate the underlying facts. The accuracies achieved are promising results and could open up a new research direction leading to enrichment of control commands generation for fNIRS-based brain-computer interface applications.

Download Full-text

Prediction of Liver Diseases by Using Few Machine Learning Based Approaches

Australian Journal of Engineering and Innovative Technology ◽

10.34104/ajeit.020.085090 ◽

2020 ◽

pp. 85-90

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Liver Diseases ◽

Model Building ◽

Medical Science ◽

Machine Learning Techniques ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbors ◽

Learning Techniques

Advancement in medical science has always been one of the most vital aspects of the human race. With the progress in technology, the use of modern techniques and equipment is always imposed on treatment purposes. Nowadays, machine learning techniques have widely been used in medical science for assuring accuracy. In this work, we have constructed computational model building techniques for liver disease prediction accurately. We used some efficient classification algorithms: Random Forest, Perceptron, Decision Tree, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) for predicting liver diseases. Our works provide the implementation of hybrid model construction and comparative analysis for improving prediction performance. At first, classification algorithms are applied to the original liver patient datasets collected from the UCI repository. Then we analyzed features and tweaked to improve the performance of our predictor and made a comparative analysis among the classifiers. We examined that, KNN algorithm outperformed all other techniques with feature selection.

Download Full-text

Evaluation of machine learning algorithms for classification of primary biological aerosol using a new UV-LIF spectrometer

Atmospheric Measurement Techniques ◽

10.5194/amt-10-695-2017 ◽

2017 ◽

Vol 10 (2) ◽

pp. 695-708 ◽

Cited By ~ 25

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Neural Networks ◽

Decision Trees ◽

Supervised Learning ◽

Ensemble Methods ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Shape Information ◽

Accuracy Of Measurements

Abstract. Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.

Download Full-text

The Effect of Weather in Soccer Results: An Approach Using Machine Learning Techniques

Applied Sciences ◽

10.3390/app10196750 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6750

Author(s):

Ditsuhi Iskandaryan ◽

Francisco Ramos ◽

Denny Asarias Palinggi ◽

Sergio Trilles

Keyword(s):

Support Vector Machine ◽

Nearest Neighbors ◽

Research Community ◽

Machine Learning Techniques ◽

Weather Data ◽

Support Vector ◽

K Nearest Neighbors ◽

Task Support ◽

Extremely Randomized Trees ◽

Learning Techniques

The growing popularity of soccer has led to the prediction of match results becoming of interest to the research community. The aim of this research is to detect the effects of weather on the result of matches by implementing Random Forest, Support Vector Machine, K-Nearest Neighbors Algorithm, and Extremely Randomized Trees Classifier. The analysis was executed using the Spanish La Liga and Segunda division from the seasons 2013–2014 to 2017–2018 in combination with weather data. Two tasks were proposed as part of this study: the first was to find out whether the game will end in a draw, a win by the hosts or a victory by the guests, and the second was to determine whether the match will end in a draw or if one of the teams will win. The results show that, for the first task, Extremely Randomized Trees Classifier is a better method, with an accuracy of 65.9%, and, for the second task, Support Vector Machine yielded better results with an accuracy of 79.3%. Moreover, it is possible to predict whether the game will end in a draw or not with 0.85 AUC-ROC. Additionally, for comparative purposes, the analysis was also performed without weather data.

Download Full-text

Detection of Loss Zones while Drilling Using Different Machine Learning Techniques

Journal of Energy Resources Technology ◽

10.1115/1.4051553 ◽

2021 ◽

pp. 1-29

Author(s):

Ahmed Alsaihati ◽

Mahmoud Abughaban ◽

Salaheldin Elkatatny ◽

Abdulazeez Abdulraheem

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Random Forests ◽

Nearest Neighbors ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbors ◽

Learning Techniques ◽

Vector Machines ◽

Testing Set

Abstract Fluid loss into formations is a common operational issue that is frequently encountered when drilling across naturally or induced fractured formations. This could pose significant operational risks, such as well-control, stuck pipe, and wellbore instability, which, in turn, lead to an increase of well time and cost. This research aims to use and evaluate different machine learning techniques, namely: support vector machines, random forests, and K-nearest neighbors in detecting loss circulation occurrences while drilling using solely drilling surface parameters. Actual field data of seven wells, which had suffered partial or severe loss circulation, were used to build predictive models, while Well-8 was used to compare the performance of the developed models. Different performance metrics were used to evaluate the performance of the developed models. Recall, precision, and F1-score measures were used to evaluate the ability of the developed model to detect loss circulation occurrences. The results showed the K-nearest neighbors classifier achieved a high F1-score of 0.912 in detecting loss circulation occurrence in the testing set, while the random forests was the second-best classifier with almost the same F1-score of 0.910. The support vector machines achieved an F1-score of 0.83 in predicting the loss circulation occurrence in the testing set. The K-nearest neighbors outperformed other models in detecting the loss circulation occurrences in Well-8 with an F1-score of 0.80. The main contribution of this research as compared to previous studies is that it identifies losses events based on real-time measurements of the active pit volume.

Download Full-text