A Hadoop Based Framework Integrating Machine Learning Classifiers for Anomaly Detection in the Internet of Things

In recent years, different variants of the botnet are targeting government, private organizations and there is a crucial need to develop a robust framework for securing the IoT (Internet of Things) network. In this paper, a Hadoop based framework is proposed to identify the malicious IoT traffic using a modified Tomek-link under-sampling integrated with automated Hyper-parameter tuning of machine learning classifiers. The novelty of this paper is to utilize a big data platform for benchmark IoT datasets to minimize computational time. The IoT benchmark datasets are loaded in the Hadoop Distributed File System (HDFS) environment. Three machine learning approaches namely naive Bayes (NB), K-nearest neighbor (KNN), and support vector machine (SVM) are used for categorizing IoT traffic. Artificial immune network optimization is deployed during cross-validation to obtain the best classifier parameters. Experimental analysis is performed on the Hadoop platform. The average accuracy of 99% and 90% is obtained for BoT_IoT and ToN_IoT datasets. The accuracy difference in ToN-IoT dataset is due to the huge number of data samples captured at the edge layer and fog layer. However, in BoT-IoT dataset only 5% of the training and test samples from the complete dataset are considered for experimental analysis as released by the dataset developers. The overall accuracy is improved by 19% in comparison with state-of-the-art techniques. The computational times for the huge datasets are reduced by 3–4 hours through Map Reduce in HDFS.

Download Full-text

Machine Learning Approaches Applied to GC-FID Fatty Acid Profiles to Discriminate Wild from Farmed Salmon

Foods ◽

10.3390/foods9111622 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1622

Author(s):

Liliana Grazina ◽

P. J. Rodrigues ◽

Getúlio Igrejas ◽

Maria A. Nunes ◽

Isabel Mafra ◽

...

Keyword(s):

Machine Learning ◽

Fatty Acid ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Learning Approaches ◽

Machine Learning Classifiers ◽

Farmed Salmon ◽

Learning Classifiers

In the last decade, there has been an increasing demand for wild-captured fish, which attains higher prices compared to farmed species, thus being prone to mislabeling practices. In this work, fatty acid composition coupled to advanced chemometrics was used to discriminate wild from farmed salmon. The lipids extracted from salmon muscles of different production methods and origins (26 wild from Canada, 25 farmed from Canada, 24 farmed from Chile and 25 farmed from Norway) were analyzed by gas chromatography with flame ionization detector (GC-FID). All the tested chemometric approaches, namely principal components analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and seven machine learning classifiers, namely k-nearest neighbors (kNN), decision tree, support vector machine (SVM), random forest, artificial neural networks (ANN), naïve Bayes and AdaBoost, allowed for differentiation between farmed and wild salmons using the 17 features obtained from chemical analysis. PCA did not allow clear distinguishing between salmon geographical origin since farmed samples from Canada and Chile overlapped. Nevertheless, using the 17 features in the models, six out of the seven tested machine learning classifiers allowed a classification accuracy of ≥99%, with ANN, naïve Bayes, random forest, SVM and kNN presenting 100% accuracy on the test dataset. The classification models were also assayed using only the best features selected by a reduction algorithm and the best input features mapped by t-SNE. The classifier kNN provided the best discrimination results because it correctly classified all samples according to production method and origin, ultimately using only the three most important features (16:0, 18:2n6c and 20:3n3 + 20:4n6). In general, the classifiers presented good generalization with the herein proposed approach being simple and presenting the advantage of requiring only common equipment existing in most labs.

Download Full-text

Towards Improving Offline Signature Verification based Authentication Using Machine Learning Classifiers

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9910.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 3393-3401

Keyword(s):

Machine Learning ◽

Sample Size ◽

Nearest Neighbor ◽

Turnaround Time ◽

Signature Verification ◽

Support Vector ◽

Paper Machine ◽

K Nearest Neighbor ◽

Machine Learning Classifiers ◽

Learning Classifiers

Signatures have been accepted in commercial transactions as a method of authentication. Digitizing credentials reduce the storage space requisite for the same information from a few cubic inches to so many bytes on a server. The most frequent use of offline signature authentication is to reduce the turnaround time for cheque clearance. In this paper, machine learning classifiers are used to verify the signature using four image based features. BHsig260 dataset (Bangla and Hindi) has been used. We used signatures of 55 users of Hindi and Bangla each. .Six classifier i.e. Boosted Tree, Random forest classifier (RFC), K-nearest neighbor, Multilayer Perceptron, Support Vector Machine (SVM) and Naive Bayes classifier are used in the work. In the paper, the results of Writer independent model show that accuracy of Hindi off-line signature verification is 72.3 % using MLP with the signature sample size of 20 and that of Bangla is 79 % using RFC with the signature sample size of 23.In user dependent model, for some users, we achieved accuracy of more than 92 % using KNN and SVM.

Download Full-text

Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs

Current Drug Targets ◽

10.2174/1389450119666180809122244 ◽

2019 ◽

Vol 20 (5) ◽

pp. 488-500 ◽

Cited By ~ 6

Author(s):

Yan Hu ◽

Yi Lu ◽

Shuo Wang ◽

Mengying Zhang ◽

Xiaosheng Qu ◽

...

Keyword(s):

Machine Learning ◽

Drug Design ◽

Anticancer Drugs ◽

Nearest Neighbor ◽

Cost Effective ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Activity Prediction ◽

Linear Discriminant

Background: Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. Objective: In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. Results: Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. Conclusion: This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.

Download Full-text

Classifying Lensed Gravitational Waves in the Geometrical Optics Limit with Machine Learning

American Journal of Undergraduate Research ◽

10.33697/ajur.2019.019 ◽

2019 ◽

Vol 16 (2) ◽

pp. 5-16

Author(s):

Amit Singh ◽

Ivan Li ◽

Otto Hannuksela ◽

Tjonnie Li ◽

Kyungmin Kim

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Gravitational Wave ◽

Gravitational Waves ◽

Geometrical Optics ◽

Supervised Machine Learning ◽

Support Vector ◽

Multi Layer Perceptron ◽

Machine Learning Classifiers ◽

Learning Classifiers

Gravitational waves are theorized to be gravitationally lensed when they propagate near massive objects. Such lensing effects cause potentially detectable repeated gravitational wave patterns in ground- and space-based gravitational wave detectors. These effects are difficult to discriminate when the lens is small and the repeated patterns superpose. Traditionally, matched filtering techniques are used to identify gravitational-wave signals, but we instead aim to utilize machine learning techniques to achieve this. In this work, we implement supervised machine learning classifiers (support vector machine, random forest, multi-layer perceptron) to discriminate such lensing patterns in gravitational wave data. We train classifiers with spectrograms of both lensed and unlensed waves using both point-mass and singular isothermal sphere lens models. As the result, classifiers return F1 scores ranging from 0:852 to 0:996, with precisions from 0:917 to 0:992 and recalls ranging from 0:796 to 1:000 depending on the type of classifier and lensing model used. This supports the idea that machine learning classifiers are able to correctly determine lensed gravitational wave signals. This also suggests that in the future, machine learning classifiers may be used as a possible alternative to identify lensed gravitational wave events and to allow us to study gravitational wave sources and massive astronomical objects through further analysis. KEYWORDS: Gravitational Waves; Gravitational Lensing; Geometrical Optics; Machine Learning; Classification; Support Vector Machine; Random Tree Forest; Multi-layer Perceptron

Download Full-text

Comparative Study of Machine Learning Classifiers for Modelling Road Traffic Accidents

Applied Sciences ◽

10.3390/app12020828 ◽

2022 ◽

Vol 12 (2) ◽

pp. 828

Author(s):

Tebogo Bokaba ◽

Wesley Doorsamy ◽

Babu Sena Paul

Keyword(s):

Machine Learning ◽

Traffic Accidents ◽

Road Traffic ◽

Real Life ◽

Support Vector ◽

Road Traffic Accidents ◽

Machine Learning Classifiers ◽

Reduction Techniques ◽

Learning Classifiers ◽

Accident Data

Road traffic accidents (RTAs) are a major cause of injuries and fatalities worldwide. In recent years, there has been a growing global interest in analysing RTAs, specifically concerned with analysing and modelling accident data to better understand and assess the causes and effects of accidents. This study analysed the performance of widely used machine learning classifiers using a real-life RTA dataset from Gauteng, South Africa. The study aimed to assess prediction model designs for RTAs to assist transport authorities and policymakers. It considered classifiers such as naïve Bayes, logistic regression, k-nearest neighbour, AdaBoost, support vector machine, random forest, and five missing data methods. These classifiers were evaluated using five evaluation metrics: accuracy, root-mean-square error, precision, recall, and receiver operating characteristic curves. Furthermore, the assessment involved parameter adjustment and incorporated dimensionality reduction techniques. The empirical results and analyses show that the RF classifier, combined with multiple imputations by chained equations, yielded the best performance when compared with the other combinations.

Download Full-text

Different Machine Learning Classifiers for Music Emotion Recognition

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7833.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 2187-2191

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Bayes Classifier ◽

Promising Alternative ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Statistical Metrics

Music in an essential part of life and the emotion carried by it is key to its perception and usage. Music Emotion Recognition (MER) is the task of identifying the emotion in musical tracks and classifying them accordingly. The objective of this research paper is to check the effectiveness of popular machine learning classifiers like XGboost, Random Forest, Decision Trees, Support Vector Machine (SVM), K-Nearest-Neighbour (KNN) and Gaussian Naive Bayes on the task of MER. Using the MIREX-like dataset [17] to test these classifiers, the effects of oversampling algorithms like Synthetic Minority Oversampling Technique (SMOTE) [22] and Random Oversampling (ROS) were also verified. In all, the Gaussian Naive Bayes classifier gave the maximum accuracy of 40.33%. The other classifiers gave accuracies in between 20.44% and 38.67%. Thus, a limit on the classification accuracy has been reached using these classifiers and also using traditional musical or statistical metrics derived from the music as input features. In view of this, deep learning-based approaches using Convolutional Neural Networks (CNNs) [13] and spectrograms of the music clips for MER is a promising alternative.

Download Full-text

Investigating the Physics of Tokamak Global Stability with Interpretable Machine Learning Tools

Applied Sciences ◽

10.3390/app10196683 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6683

Author(s):

Andrea Murari ◽

Emmanuele Peluso ◽

Michele Lungaroni ◽

Riccardo Rossi ◽

Michela Gelfusa ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Independent Learning ◽

Support Vector ◽

Learning Tools ◽

Feedback Systems ◽

Theoretical Understanding ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Mining Tools

The inadequacies of basic physics models for disruption prediction have induced the community to increasingly rely on data mining tools. In the last decade, it has been shown how machine learning predictors can achieve a much better performance than those obtained with manually identified thresholds or empirical descriptions of the plasma stability limits. The main criticisms of these techniques focus therefore on two different but interrelated issues: poor “physics fidelity” and limited interpretability. Insufficient “physics fidelity” refers to the fact that the mathematical models of most data mining tools do not reflect the physics of the underlying phenomena. Moreover, they implement a black box approach to learning, which results in very poor interpretability of their outputs. To overcome or at least mitigate these limitations, a general methodology has been devised and tested, with the objective of combining the predictive capability of machine learning tools with the expression of the operational boundary in terms of traditional equations more suited to understanding the underlying physics. The proposed approach relies on the application of machine learning classifiers (such as Support Vector Machines or Classification Trees) and Symbolic Regression via Genetic Programming directly to experimental databases. The results are very encouraging. The obtained equations of the boundary between the safe and disruptive regions of the operational space present almost the same performance as the machine learning classifiers, based on completely independent learning techniques. Moreover, these models possess significantly better predictive power than traditional representations, such as the Hugill or the beta limit. More importantly, they are realistic and intuitive mathematical formulas, which are well suited to supporting theoretical understanding and to benchmarking empirical models. They can also be deployed easily and efficiently in real-time feedback systems.

Download Full-text

Linear SVM-Based Android Malware Detection for Reliable IoT Services

Journal of Applied Mathematics ◽

10.1155/2014/594501 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 35

Author(s):

Hyo-Sik Ham ◽

Hwan-Hee Kim ◽

Myung-Sup Kim ◽

Mi-Jung Choi

Keyword(s):

Machine Learning ◽

Mobile Devices ◽

Malware Detection ◽

Information Leakage ◽

Support Vector ◽

Android Malware ◽

Machine Learning Classifiers ◽

Android Malware Detection ◽

Learning Classifiers ◽

Linear Svm

Current many Internet of Things (IoT) services are monitored and controlled through smartphone applications. By combining IoT with smartphones, many convenient IoT services have been provided to users. However, there are adverse underlying effects in such services including invasion of privacy and information leakage. In most cases, mobile devices have become cluttered with important personal user information as various services and contents are provided through them. Accordingly, attackers are expanding the scope of their attacks beyond the existing PC and Internet environment into mobile devices. In this paper, we apply a linear support vector machine (SVM) to detect Android malware and compare the malware detection performance of SVM with that of other machine learning classifiers. Through experimental validation, we show that the SVM outperforms other machine learning classifiers.

Download Full-text

Diagnosis of Problems in Truck Ore Transport Operations in Underground Mines Using Various Machine Learning Models and Data Collected by Internet of Things Systems

Minerals ◽

10.3390/min11101128 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1128

Author(s):

Sebeom Park ◽

Dahee Jung ◽

Hoang Nguyen ◽

Yosoon Choi

Keyword(s):

Machine Learning ◽

Internet Of Things ◽

Production Management ◽

Classification And Regression Tree ◽

Underground Mines ◽

Validation Dataset ◽

Support Vector ◽

Learning Models ◽

K Nearest Neighbor ◽

Machine Learning Models

This study proposes a method for diagnosing problems in truck ore transport operations in underground mines using four machine learning models (i.e., Gaussian naïve Bayes (GNB), k-nearest neighbor (kNN), support vector machine (SVM), and classification and regression tree (CART)) and data collected by an Internet of Things system. A limestone underground mine with an applied mine production management system (using a tablet computer and Bluetooth beacon) is selected as the research area, and log data related to the truck travel time are collected. The machine learning models are trained and verified using the collected data, and grid search through 5-fold cross-validation is performed to improve the prediction accuracy of the models. The accuracy of CART is highest when the parameters leaf and split are set to 1 and 4, respectively (94.1%). In the validation of the machine learning models performed using the validation dataset (1500), the accuracy of the CART was 94.6%, and the precision and recall were 93.5% and 95.7%, respectively. In addition, it is confirmed that the F1 score reaches values as high as 94.6%. Through field application and analysis, it is confirmed that the proposed CART model can be utilized as a tool for monitoring and diagnosing the status of truck ore transport operations.

Download Full-text

Event classification from the Urdu language text on social media

PeerJ Computer Science ◽

10.7717/peerj-cs.775 ◽

2021 ◽

Vol 7 ◽

pp. e775

Author(s):

Malik Daler Ali Awan ◽

Nadeem Iqbal Kajla ◽

Amnah Firdous ◽

Mujtaba Husnain ◽

Malik Muhammad Saad Missen

Keyword(s):

Machine Learning ◽

Social Media ◽

Nearest Neighbor ◽

Event Extraction ◽

K Nearest Neighbor ◽

Event Classification ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Multilingual Data ◽

Language Text

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally i.e., sports, inflation, protest, explosion, and sexual assault, etc. in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency (tf-idf) showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.

Download Full-text