An Opcode-Based Malware Detection Model Using Supervised Learning Algorithms

2021 ◽  
Vol 15 (4) ◽  
pp. 18-30
Author(s):  
Om Prakash Samantray ◽  
Satya Narayan Tripathy

There are several malware detection techniques available that are based on a signature-based approach. This approach can detect known malware very effectively but sometimes may fail to detect unknown or zero-day attacks. In this article, the authors have proposed a malware detection model that uses operation codes of malicious and benign executables as the feature. The proposed model uses opcode extract and count (OPEC) algorithm to prepare the opcode feature vector for the experiment. Most relevant features are selected using extra tree classifier feature selection technique and then passed through several supervised learning algorithms like support vector machine, naive bayes, decision tree, random forest, logistic regression, and k-nearest neighbour to build classification models for malware detection. The proposed model has achieved a detection accuracy of 98.7%, which makes this model better than many of the similar works discussed in the literature.

Author(s):  
G Manoharan ◽  
K Sivakumar

Outlier detection in data mining is an important arena where detection models are developed to discover the objects that do not confirm the expected behavior. The generation of huge data in real time applications makes the outlier detection process into more crucial and challenging. Traditional detection techniques based on mean and covariance are not suitable to handle large amount of data and the results are affected by outliers. So it is essential to develop an efficient outlier detection model to detect outliers in the large dataset. The objective of this research work is to develop an efficient outlier detection model for multivariate data employing the enhanced Hidden Semi-Markov Model (HSMM). It is an extension of conventional Hidden Markov Model (HMM) where the proposed model allows arbitrary time distribution in its states to detect outliers. Experimental results demonstrate the better performance of proposed model in terms of detection accuracy, detection rate. Compared to conventional Hidden Markov Model based outlier detection the detection accuracy of proposed model is obtained as 98.62% which is significantly better for large multivariate datasets.


Android malware have risen exponentially over the past few years, posing several serious threats such as system damage, financial loss, and mobile botnets. Various detection techniques have been proposed in the literature for Android malware detection. Some of the techniques analyze static parameters such as permissions, or intents, whereas, others focus on dynamic parameters such as network traffic or system calls. Static techniques are relatively easier to implement, however, stealthy recent malware evade static detection by virtue of update attacks. Dynamic detection can be used to detect such stealthy malware, however, it increases the computation overhead. Hence, both kinds of techniques have their own advantages and disadvantages. In this paper, we have proposed an innovative hybrid detection model that uses both static and dynamic features for malware analysis and detection. We first rank the static and dynamic parameters according to the information gain and then apply machine learning algorithms in the testing phase. The results indicate that hybrid approach is better than both static and dynamic approaches and the proposed model achieves 98.9% detection accuracy with Decision Tree classifier


2017 ◽  
Vol 20 (3) ◽  
pp. 985-994 ◽  
Author(s):  
Leili Shahriyari

Abstract Motivation: One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes). Results: Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%.


2014 ◽  
Vol 14 (3) ◽  
pp. 5535-5542
Author(s):  
Sagri Sharma ◽  
Sanjay Kadam ◽  
Hemant Darbari

Analysis of diseases integrating multi-factors increases the complexity of the problem and therefore, development of frameworks for the analysis of diseases is an issue that is currently a topic of intense research. Due to the inter-dependence of the various parameters, the use of traditional methodologies has not been very effective. Consequently, newer methodologies are being sought to deal with the problem. Supervised Learning Algorithms are commonly used for performing the prediction on previously unseen data. These algorithms are commonly used for applications in fields ranging from image analysis to protein structure and function prediction and they get trained using a known dataset to come up with a predictor model that generates reasonable predictions for the response to new data. Gene expression profiles generated by DNA analysis experiments can be quite complex since these experiments can involve hypotheses involving entire genomes. The application of well-known machine learning algorithm - Support Vector Machine - to analyze the expression levels of thousands of genes simultaneously in a timely, automated and cost effective way is thus used. The objectives to undertake the presented work are development of a methodology to identify genes relevant to Hepatocellular Carcinoma (HCC) from gene expression dataset utilizing supervised learning algorithms & statistical evaluations along with development of a predictive framework that can perform classification tasks on new, unseen data


The present research explores the loyalty prediction problem of a brand through supervised learning algorithms of classifications: logistic regression, decision tree, support vector machine, bayes algorithm and K-nearest neighbors (KNN) algorithm. 265 customers’ FMCG loyalty sample data were taken and variables of the data set include; loyalty status, gender, family size, age, frequency of purchase, and FMCG purchase. Data have been analyzed with the help of Python packages such as Pandas (Data analysis), Numpy (Numerical calculation), Matplotlib (Visualization), and Sklearn (Modeling). Among the supervised classification algorithms, logistic regression has outperformed than other techniques


2019 ◽  
Vol 11 (10) ◽  
pp. 1220 ◽  
Author(s):  
Peibei Cao ◽  
Weijie Xia ◽  
Yi Li

Human identification based on radar signatures of individual heartbeats is crucial in various applications, including user authentication in mobile devices, identification of escaped criminals, etc. Usually, optical systems employed to recognize humans are sensitive to ambient light environments, while radar does not have such a drawback, since it has high penetration and all-weather capability. Meanwhile, since micro-Doppler characteristics from the heart of different people are distinct and not easy to fake, it can be used for identification. In this paper, we employed a deep convolutional neural network (DCNN) and conventional supervised learning methods to realize heartbeat-based identification. First, the heartbeat signals were acquired by a Doppler radar and processed by short-time Fourier transform. Then, predefined features were extracted for the conventional supervised learning algorithms, while time–frequency graphs were directly inputted to the DCNN since the network had its own feature extraction part. It is shown that the DCNN could achieve average accuracy of 98.5% for identifying four people, and higher than 80% when the number of people was less than ten. For conventional supervised learning algorithms when identifying four people, the accuracy of the support vector machine (SVM) was 88.75%, and the accuracy of SVM–Bayes was 91.25%, while naive Bayes had the lowest accuracy of 80.75%.


2021 ◽  
Vol 13 (15) ◽  
pp. 3024
Author(s):  
Huiqin Ma ◽  
Wenjiang Huang ◽  
Yingying Dong ◽  
Linyi Liu ◽  
Anting Guo

Fusarium head blight (FHB) is a major winter wheat disease in China. The accurate and timely detection of wheat FHB is vital to scientific field management. By combining three types of spectral features, namely, spectral bands (SBs), vegetation indices (VIs), and wavelet features (WFs), in this study, we explore the potential of using hyperspectral imagery obtained from an unmanned aerial vehicle (UAV), to detect wheat FHB. First, during the wheat filling period, two UAV-based hyperspectral images were acquired. SBs, VIs, and WFs that were sensitive to wheat FHB were extracted and optimized from the two images. Subsequently, a field-scale wheat FHB detection model was formulated, based on the optimal spectral feature combination of SBs, VIs, and WFs (SBs + VIs + WFs), using a support vector machine. Two commonly used data normalization algorithms were utilized before the construction of the model. The single WFs, and the spectral feature combination of optimal SBs and VIs (SBs + VIs), were respectively used to formulate models for comparison and testing. The results showed that the detection model based on the normalized SBs + VIs + WFs, using min–max normalization algorithm, achieved the highest R2 of 0.88 and the lowest RMSE of 2.68% among the three models. Our results suggest that UAV-based hyperspectral imaging technology is promising for the field-scale detection of wheat FHB. Combining traditional SBs and VIs with WFs can improve the detection accuracy of wheat FHB effectively.


Author(s):  
Leandro Skowronski ◽  
Paula Martin de Moraes ◽  
Mario Luiz Teixeira de Moraes ◽  
Wesley Nunes Gonçalves ◽  
Michel Constantino ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document