Detecting Malware Based on Opcode N-Gram and Machine Learning

Author(s):  
Pengfei Li ◽  
Zhouguo Chen ◽  
Baojiang Cui
Keyword(s):  
Author(s):  
Jia Luo ◽  
Dongwen Yu ◽  
Zong Dai

It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.


Author(s):  
Saugata Bose ◽  
Ritambhra Korpal

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.


Author(s):  
Prayag Tiwari ◽  
Brojo Kishore Mishra ◽  
Sachin Kumar ◽  
Vivek Kumar

Sentiment Analysis intends to get the basic perspective of the content, which may be anything that holds a subjective supposition, for example, an online audit, Comments on Blog posts, film rating and so forth. These surveys and websites might be characterized into various extremity gatherings, for example, negative, positive, and unbiased keeping in mind the end goal to concentrate data from the info dataset. Supervised machine learning strategies group these reviews. In this paper, three distinctive machine learning calculations, for example, Support Vector Machine (SVM), Maximum Entropy (ME) and Naive Bayes (NB), have been considered for the arrangement of human conclusions. The exactness of various strategies is basically inspected keeping in mind the end goal to get to their execution on the premise of parameters, e.g. accuracy, review, f-measure, and precision.


2016 ◽  
Vol 57 ◽  
pp. 117-126 ◽  
Author(s):  
Abinash Tripathy ◽  
Ankit Agrawal ◽  
Santanu Kumar Rath

Sci ◽  
2020 ◽  
Vol 2 (4) ◽  
pp. 92
Author(s):  
Ovidiu Calin

This paper presents a quantitative approach to poetry, based on the use of several statistical measures (entropy, informational energy, N-gram, etc.) applied to a few characteristic English writings. We found that English language changes its entropy as time passes, and that entropy depends on the language used and on the author. In order to compare two similar texts, we were able to introduce a statistical method to asses the information entropy between two texts. We also introduced a method of computing the average information conveyed by a group of letters about the next letter in the text. We found a formula for computing the Shannon language entropy and we introduced the concept of N-gram informational energy of a poetry. We also constructed a neural network, which is able to generate Byron-type poetry and to analyze the information proximity to the genuine Byron poetry.


Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1777
Author(s):  
Muhammad Ali ◽  
Stavros Shiaeles ◽  
Gueltoum Bendiab ◽  
Bogdan Ghita

Detection and mitigation of modern malware are critical for the normal operation of an organisation. Traditional defence mechanisms are becoming increasingly ineffective due to the techniques used by attackers such as code obfuscation, metamorphism, and polymorphism, which strengthen the resilience of malware. In this context, the development of adaptive, more effective malware detection methods has been identified as an urgent requirement for protecting the IT infrastructure against such threats, and for ensuring security. In this paper, we investigate an alternative method for malware detection that is based on N-grams and machine learning. We use a dynamic analysis technique to extract an Indicator of Compromise (IOC) for malicious files, which are represented using N-grams. The paper also proposes TF-IDF as a novel alternative used to identify the most significant N-grams features for training a machine learning algorithm. Finally, the paper evaluates the proposed technique using various supervised machine-learning algorithms. The results show that Logistic Regression, with a score of 98.4%, provides the best classification accuracy when compared to the other classifiers used.


2021 ◽  
Vol 15 (3) ◽  
pp. 265-290
Author(s):  
Saleh Abdulaziz Habtor ◽  
Ahmed Haidarah Hasan Dahah

The spread of ransomware has risen exponentially over the past decade, causing huge financial damage to multiple organizations. Various anti-ransomware firms have suggested methods for preventing malware threats. The growing pace, scale and sophistication of malware provide the anti-malware industry with more challenges. Recent literature indicates that academics and anti-virus organizations have begun to use artificial learning as well as fundamental modeling techniques for the research and identification of malware. Orthodox signature-based anti-virus programs struggle to identify unfamiliar malware and track new forms of malware. In this study, a malware evaluation framework focused on machine learning was adopted that consists of several modules: dataset compiling in two separate classes (malicious and benign software), file disassembly, data processing, decision making, and updated malware identification. The data processing module uses grey images, functions for importing and Opcode n-gram to remove malware functionality. The decision making module detects malware and recognizes suspected malware. Different classifiers were considered in the research methodology for the detection and classification of malware. Its effectiveness was validated on the basis of the accuracy of the complete process.


2019 ◽  
Vol 9 (18) ◽  
pp. 3723
Author(s):  
Sharif ◽  
Mumtaz ◽  
Shafiq ◽  
Riaz ◽  
Ali ◽  
...  

The rise of social media has led to an increasing online cyber-war via hate and violent comments or speeches, and even slick videos that lead to the promotion of extremism and radicalization. An analysis to sense cyber-extreme content from microblogging sites, specifically Twitter, is a challenging, and an evolving research area since it poses several challenges owing short, noisy, context-dependent, and dynamic nature content. The related tweets were crawled using query words and then carefully labelled into two classes: Extreme (having two sub-classes: pro-Afghanistan government and pro-Taliban) and Neutral. An Exploratory Data Analysis (EDA) using Principal Component Analysis (PCA), was performed for tweets data (having Term Frequency—Inverse Document Frequency (TF-IDF) features) to reduce a high-dimensional data space into a low-dimensional (usually 2-D or 3-D) space. PCA-based visualization has shown better cluster separation between two classes (extreme and neutral), whereas cluster separation, within sub-classes of extreme class, was not clear. The paper also discusses the pros and cons of applying PCA as an EDA in the context of textual data that is usually represented by a high-dimensional feature set. Furthermore, the classification algorithms like naïve Bayes’, K Nearest Neighbors (KNN), random forest, Support Vector Machine (SVM) and ensemble classification methods (with bagging and boosting), etc., were applied with PCA-based reduced features and with a complete set of features (TF-IDF features extracted from n-gram terms in the tweets). The analysis has shown that an SVM demonstrated an average accuracy of 84% compared with other classification models. It is pertinent to mention that this is the novel reported research work in the context of Afghanistan war zone for Twitter content analysis using machine learning methods.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 6558-6558
Author(s):  
Fernando Jose Suarez Saiz ◽  
Corey Sanders ◽  
Rick J Stevens ◽  
Robert Nielsen ◽  
Michael W Britt ◽  
...  

6558 Background: Finding high-quality science to support decisions for individual patients is challenging. Common approaches to assess clinical literature quality and relevance rely on bibliometrics or expert knowledge. We describe a method to automatically identify clinically relevant, high-quality scientific citations using abstract content. Methods: We used machine learning trained on text from PubMed papers cited in 3 expert resources: NCCN, NCI-PDQ, and Hemonc.org. Balanced training data included text cited in at least two sources to form an “on topic” set (i.e., relevant and high quality), and an “off-topic” set, not cited in any of the above 3 sources. The off-topic set was published in lower ranked journals, using a citation-based score. Articles were part of an Oncology Clinical Trial corpus generated using a standard PubMed query. We used a gradient boosted-tree approach with a binary logistic supervised learning classification. Briefly, 988 texts were processed to produce a term frequency-inverse document frequency (tf-idf) n-gram representation of both the training and the test set (70/30 split). Ideal parameters were determined using 1000-fold cross validation. Results: Our model classified papers in the test set with 0.93 accuracy (95% CI (0.09:0.96) p ≤ 0.0001), with sensitivity 0.95 and specificity 0.91. Some false positives contained language considered clinically relevant that may have been missed or not yet included in expert resources. False negatives revealed a potential bias towards chemotherapy-focused research over radiation therapy or surgical approaches. Conclusions: Machine learning can be used to automatically identify relevant clinical publications from biographic databases, without relying on expert curation or bibliometric methods. The use of machine learning to identify relevant publications may reduce the time clinicians spend finding pertinent evidence for a patient. This approach is generalizable to cases where a corpus of high-quality publications that can serve as a training set exists or cases where document metadata is unreliable, as is the case of “grey” literature within oncology and beyond to other diseases. Future work will extend this approach and may integrate it into oncology clinical decision-support tools.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 1539-1539
Author(s):  
Shailendra Lakhanpal ◽  
Kailee Hawkins ◽  
Steven G. Dunder ◽  
Karri Donahue ◽  
Madeline Richey ◽  
...  

1539 Background: Clinical trial eligibility increasingly requires information found in NGS tests; lack of structured NGS results hinders the automation of trial matching for this criterion, which may be a deterrent to open biomarker-driven trials in certain sites. We developed a machine learning tool that infers the presence of NGS results in the EHR, facilitating clinical trial matching. Methods: The Flatiron Health EHR-derived database contains patient-level pathology and genetic counseling reports from community oncology practices. An internal team of clinical experts reviewed a random sample of patients across this network to generate labels of whether each patient had been NGS tested. A supervised ML model was trained by scanning documents in the EHR and extracting n-gram features from text snippets surrounding relevant keywords (i.e. 'Lung biomarker', 'Biomarker negative'). Through k-fold cross-validation and l2-regularization, we found that a logistic regression was able to classify patients' NGS testing status. The model's offline performance on a 20% hold-out test set was measured with standard classification metrics: sensitivity, specificity, positive predictive value (PPV) and NPV. In an online setting, we integrated the tool into Flatiron's clinical trial matching software OncoTrials by including in each patient's profile an indicator of "likely NGS tested" or "unlikely NGS tested" based on the classifier's prediction. For patients inferred as tested, the model linked users to a test report view in the EHR. In this online setting, we measured sensitivity and specificity of the model after user review in two community oncology practices. Results: This NGS testing status inference model was characterized using a test sample of 15,175 patients. The model sensitivity and specificity (95%CI) were 91.3% (90.2, 92.3) and 96.2% (95.8, 96.5), respectively; PPV was 84.5% (83.2, 85.8) and NPV was 98.0% (97.7, 98.2). In the validation sample (N = 200 originated from 2 distinct care sites), users identified NGS testing status with a sensitivity of 95.2% (88.3%, 98.7%). Conclusions: This machine learning model facilitates the screening for potential patient enrollment in biomarker-driven trials by automatically surfacing patients with NGS test results at high sensitivity and specificity into a trial matching application to identify candidates. This tool could mitigate a key barrier for participation in biomarker-driven trials for community clinics.


Sign in / Sign up

Export Citation Format

Share Document