Analysis of Machine Learning Techniques for Anomaly-Based Intrusion Detection

2020 ◽  
Vol 12 (1) ◽  
pp. 20-38
Author(s):  
Winfred Yaokumah ◽  
Isaac Wiafe

Determining the machine learning (ML) technique that performs best on new datasets is an important factor in the design of effective anomaly-based intrusion detection systems. This study therefore evaluated four machine learning algorithms (naive Bayes, k-nearest neighbors, decision tree, and random forest) on UNSW-NB 15 dataset for intrusion detection. The experiment results showed that random forest and decision tree classifiers are effective for detecting intrusion. Random forest had the highest weighted average accuracy of 89.66% and a mean absolute error (MAE) value of 0.0252 whereas decision tree recorded 89.20% and 0.0242, respectively. Naive Bayes classifier had the worst results on the dataset with 56.43% accuracy and a MAE of 0.0867. However, contrary to existing knowledge, naïve Bayes was observed to be potent in classifying backdoor attacks. Observably, naïve Bayes performed relatively well in classes where tree-based classifiers demonstrated abysmal performance.

2019 ◽  
Vol 9 (14) ◽  
pp. 2789 ◽  
Author(s):  
Sadaf Malik ◽  
Nadia Kanwal ◽  
Mamoona Naveed Asghar ◽  
Mohammad Ali A. Sadiq ◽  
Irfan Karamat ◽  
...  

Medical health systems have been concentrating on artificial intelligence techniques for speedy diagnosis. However, the recording of health data in a standard form still requires attention so that machine learning can be more accurate and reliable by considering multiple features. The aim of this study is to develop a general framework for recording diagnostic data in an international standard format to facilitate prediction of disease diagnosis based on symptoms using machine learning algorithms. Efforts were made to ensure error-free data entry by developing a user-friendly interface. Furthermore, multiple machine learning algorithms including Decision Tree, Random Forest, Naive Bayes and Neural Network algorithms were used to analyze patient data based on multiple features, including age, illness history and clinical observations. This data was formatted according to structured hierarchies designed by medical experts, whereas diagnosis was made as per the ICD-10 coding developed by the American Academy of Ophthalmology. Furthermore, the system is designed to evolve through self-learning by adding new classifications for both diagnosis and symptoms. The classification results from tree-based methods demonstrated that the proposed framework performs satisfactorily, given a sufficient amount of data. Owing to a structured data arrangement, the random forest and decision tree algorithms’ prediction rate is more than 90% as compared to more complex methods such as neural networks and the naïve Bayes algorithm.


The scope of this research work is to identify the efficient machine learning algorithm for predicting the behavior of a student from the student performance dataset. We applied Support Vector Machines, K-Nearest Neighbor, Decision Tree and Naïve Bayes algorithms to predict the grade of a student and compared their prediction results in terms of various performance metrics. The students who visited many resources for reference, made academic related discussions and interactions in the class room, absent for minimum days, cared by parents care have shown great improvement in the final grade. Among the machine learning techniques we have used, SVM has shown more accuracy in terms of four important attribute. The accuracy rate of SVM after tuning is 0.80. The KNN and decision tree achieves the accuracy of 0.64, 0.65 respectively whereas the Naïve Bayes achieves 0.77.


Author(s):  
Anirudh Reddy Cingireddy ◽  
Robin Ghosh ◽  
Supratik Kar ◽  
Venkata Melapu ◽  
Sravanthi Joginipeli ◽  
...  

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.


Diabetes is a most common disease that occurs to most of the humans now a day. The predictions for this disease are proposed through machine learning techniques. Through this method the risk factors of this disease are identified and can be prevented from increasing. Early prediction in such disease can be controlled and save human’s life. For the early predictions of this disease we collect data set having 8 attributes diabetic of 200 patients. The patients’ sugar level in the body is tested by the features of patient’s glucose content in the body and according to the age. The main Machine learning algorithms are Support vector machine (SVM), naive bayes (NB), K nearest neighbor (KNN) and Decision Tree (DT). In the exiting the Naive Bayes the accuracy levels are 66% but in the Decision tree the accuracy levels are 70 to 71%. The accuracy levels of the patients are not proper in range. But in XG boost classifiers even after the Naïve Bayes 74 Percentage and in Decision tree the accuracy levels are 89 to 90%. In the proposed system the accuracy ranges are shown properly and this is only used mostly. A dataset of 729 patients can be stored in Mongo DB and in that 129 patients repots are taken for the prediction purpose and the remaining are used for training. The training datasets are used for the prediction purposes.


Electronics ◽  
2021 ◽  
Vol 10 (14) ◽  
pp. 1677
Author(s):  
Ersin Elbasi ◽  
Ahmet E. Topcu ◽  
Shinu Mathew

COVID-19 is a community-acquired infection with symptoms that resemble those of influenza and bacterial pneumonia. Creating an infection control policy involving isolation, disinfection of surfaces, and identification of contagions is crucial in eradicating such pandemics. Incorporating social distancing could also help stop the spread of community-acquired infections like COVID-19. Social distancing entails maintaining certain distances between people and reducing the frequency of contact between people. Meanwhile, a significant increase in the development of different Internet of Things (IoT) devices has been seen together with cyber-physical systems that connect with physical environments. Machine learning is strengthening current technologies by adding new approaches to quickly and correctly solve problems utilizing this surge of available IoT devices. We propose a new approach using machine learning algorithms for monitoring the risk of COVID-19 in public areas. Extracted features from IoT sensors are used as input for several machine learning algorithms such as decision tree, neural network, naïve Bayes classifier, support vector machine, and random forest to predict the risks of the COVID-19 pandemic and calculate the risk probability of public places. This research aims to find vulnerable populations and reduce the impact of the disease on certain groups using machine learning models. We build a model to calculate and predict the risk factors of populated areas. This model generates automated alerts for security authorities in the case of any abnormal detection. Experimental results show that we have high accuracy with random forest of 97.32%, with decision tree of 94.50%, and with the naïve Bayes classifier of 99.37%. These algorithms indicate great potential for crowd risk prediction in public areas.


2019 ◽  
Author(s):  
Thomas M. Kaiser ◽  
Pieter B. Burger

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.


2020 ◽  
Vol 12 (1) ◽  
pp. 12 ◽  
Author(s):  
You Guo ◽  
Hector Marco-Gisbert ◽  
Paul Keir

A webshell is a command execution environment in the form of web pages. It is often used by attackers as a backdoor tool for web server operations. Accurately detecting webshells is of great significance to web server protection. Most security products detect webshells based on feature-matching methods—matching input scripts against pre-built malicious code collections. The feature-matching method has a low detection rate for obfuscated webshells. However, with the help of machine learning algorithms, webshells can be detected more efficiently and accurately. In this paper, we propose a new PHP webshell detection model, the NB-Opcode (naïve Bayes and opcode sequence) model, which is a combination of naïve Bayes classifiers and opcode sequences. Through experiments and analysis on a large number of samples, the experimental results show that the proposed method could effectively detect a range of webshells. Compared with the traditional webshell detection methods, this method improves the efficiency and accuracy of webshell detection.


Author(s):  
Jothikumar R. ◽  
Vijay Anand R. ◽  
Visu P. ◽  
Kumar R. ◽  
Susi S. ◽  
...  

Sentiment evaluation alludes to separate the sentiments from the characteristic language and to perceive the mentality about the exact theme. Novel corona infection, a harmful malady ailment, is spreading out of the blue through the quarter, which thought processes respiratory tract diseases that can change from gentle to extraordinary levels. Because of its quick nature of spreading and no conceived cure, it ushered in a vibe of stress and pressure. In this chapter, a framework perusing principally based procedure is utilized to discover the musings of the tweets related to COVID and its effect lockdown. The chapter examines the tweets identified with the hash tags of crown infection and lockdown. The tweets were marked fabulous, negative, or fair, and a posting of classifiers has been utilized to investigate the precision and execution. The classifiers utilized have been under the four models which incorporate decision tree, regression, helpful asset vector framework, and naïve Bayes forms.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-24 ◽  
Author(s):  
Amirhessam Tahmassebi ◽  
Amir H. Gandomi ◽  
Mieke H. J. Schulte ◽  
Anna E. Goudriaan ◽  
Simon Y. Foo ◽  
...  

This paper aims at developing new theory-driven biomarkers by implementing and evaluating novel techniques from resting-state scans that can be used in relapse prediction for nicotine-dependent patients and future treatment efficacy. Two classes of patients were studied. One class took the drug N-acetylcysteine and the other class took a placebo. Then, the patients underwent a double-blind smoking cessation treatment and the resting-state fMRI scans of their brains before and after treatment were recorded. The scientific research goal of this study was to interpret the fMRI connectivity maps based on machine learning algorithms to predict the patient who will relapse and the one who will not. In this regard, the feature matrix was extracted from the image slices of brain employing voxel selection schemes and data reduction algorithms. Then, the feature matrix was fed into the machine learning classifiers including optimized CART decision tree and Naive-Bayes classifier with standard and optimized implementation employing 10-fold cross-validation. Out of all the data reduction techniques and the machine learning algorithms employed, the best accuracy was obtained using the singular value decomposition along with the optimized Naive-Bayes classifier. This gave an accuracy of 93% with sensitivity-specificity of 99% which suggests that the relapse in nicotine-dependent patients can be predicted based on the resting-state fMRI images. The use of these approaches may result in clinical applications in the future.


Cardiovascular diseases are one of the main causes of mortality in the world. A proper prediction mechanism system with reasonable cost can significantly reduce this death toll in the low-income countries like Bangladesh. For those countries we propose machine learning backed embedded system that can predict possible cardiac attack effectively by excluding the high cost angiogram and incorporating only twelve (12) low cost features which are age, sex, chest pain, blood pressure, cholesterol, blood sugar, ECG results, heart rate, exercise induced angina, old peak, slope, and history of heart disease. Here, two heart disease datasets of own built NICVD (National Institute of Cardiovascular Disease, Bangladesh) patients’, and UCI (University of California Irvin) are used. The overall process comprises into four phases: Comprehensive literature review, collection of stable angina patients’ data through survey questionnaires from NICVD, feature vector dimensionality is reduced manually (from 14 to 12 dimensions), and the reduced feature vector is fed to machine learning based classifiers to obtain a prediction model for the heart disease. From the experiments, it is observed that the proposed investigation using NICVD patient’s data with 12 features without incorporating angiographic disease status to Artificial Neural Network (ANN) shows better classification accuracy of 92.80% compared to the other classifiers Decision Tree (82.50%), Naïve Bayes (85%), Support Vector Machine (SVM) (75%), Logistic Regression (77.50%), and Random Forest (75%) using the 10-fold cross validation. To accommodate small scale training and test data in our experimental environment we have observed the accuracy of ANN, Decision Tree, Naïve Bayes, SVM, Logistic Regression and Random Forest using Jackknife method, which are 84.80%, 71%, 75.10%, 75%, 75.33% and 71.42% respectively. On the other hand, the classification accuracies of the corresponding classifiers are 91.7%, 76.90%, 86.50%, 76.3%, 67.0% and 67.3%, respectively for the UCI dataset with 12 attributes. Whereas the same dataset with 14 attributes including angiographic status shows the accuracies 93.5%, 76.7%, 86.50%, 76.8%, 67.7% and 69.6% for the respective classifiers


Sign in / Sign up

Export Citation Format

Share Document