Malware Classification Using Static Disassembly and Machine Learning

10.36227/techrxiv.17259806 ◽

2021 ◽

Author(s):

Zhenshuo Chen ◽

Eoin Brophy ◽

Tomas Ward

Keyword(s):

Machine Learning ◽

Static Analysis ◽

Learning Algorithm ◽

Small Scale ◽

The Novel ◽

Machine Code ◽

Code Coverage ◽

Malware Classification ◽

Training Samples ◽

Critical Issues

<div>Network and system security are incredibly critical issues now. Due to the rapid proliferation of malware, traditional analysis methods struggle with enormous samples.</div><div>In this paper, we propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content complexity, and import libraries, to classify malware families, and use automatic machine learning to search for the best model and hyper-parameters for each feature and their combinations. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. The analysis is based on static disassembly scripts and hexadecimal machine code. Unlike dynamic behavior analysis, static analysis is resource-efficient and offers complete code coverage, but is vulnerable to code obfuscation and encryption.<br></div><div>The results demonstrate that features which work well in dynamic analysis are not necessarily effective when applied to static analysis. For instance, API 4-grams only achieve 57.96% accuracy and involve a relatively high dimensional feature set (5000 dimensions). In contrast, the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40% and the feature vector is of much smaller dimension (40 dimensions). We demonstrate the effectiveness of this approach through integration in IDA Pro, which also facilitates the collection of new training samples and subsequent model retraining.<br></div>

Download Full-text

Extraction of Sea Ice Cover by Sentinel-1 SAR Based on SVM with Unsupervised Generation of Training Data

10.20944/preprints202005.0336.v1 ◽

2020 ◽

Author(s):

Xiaoming Li ◽

Yan Sun ◽

Qiang Zhang

Keyword(s):

Machine Learning ◽

Sea Ice ◽

Learning Algorithm ◽

Texture Features ◽

Open Water ◽

Ice Cover ◽

Training Data ◽

Support Vector ◽

Training Samples

In this paper, we focus on developing a novel method to extract sea ice cover (i.e., discrimination/classification of sea ice and open water) using Sentinel-1 (S1) cross-polarization (vertical-horizontal, VH or horizontal-vertical, HV) data in extra wide (EW) swath mode based on the machine learning algorithm support vector machine (SVM). The classification basis includes the S1 radar backscatter coefficients and texture features that are calculated from S1 data using the gray level co-occurrence matrix (GLCM). Different from previous methods where appropriate samples are manually selected to train the SVM to classify sea ice and open water, we proposed a method of unsupervised generation of the training samples based on two GLCM texture features, i.e. entropy and homogeneity, that have contrasting characteristics on sea ice and open water. We eliminate the most uncertainty of selecting training samples in machine learning and achieve automatic classification of sea ice and open water by using S1 EW data. The comparison shows good agreement between the SAR-derived sea ice cover using the proposed method and a visual inspection, of which the accuracy reaches approximately 90% - 95% based on a few cases. Besides this, compared with the analyzed sea ice cover data Ice Mapping System (IMS) based on 728 S1 EW images, the accuracy of extracted sea ice cover by using S1 data is more than 80%.

Download Full-text

EDR signatures observed by MMS : a statistical study of dayside events found with machine learning

10.5194/egusphere-egu21-2381 ◽

2021 ◽

Author(s):

Quentin Lenouvel ◽

Vincent Génot ◽

Philippe Garnier ◽

Benoit Lavraud ◽

Sergio Toledo

Keyword(s):

Machine Learning ◽

Statistical Study ◽

Learning Algorithm ◽

Small Scale ◽

Diffusion Region ◽

Machine Learning Algorithm ◽

Physical Processes ◽

The Core ◽

Machine Learning Methods ◽

Guide Field

<p>The understanding of magnetic reconnection's physical processes has considerably been improved thanks to the data of the Magnetopsheric Multiscale mission (MMS). However, a lot of work still has to be done to better characterize the core of the reconnection process : the electron diffusion region (EDR). We previously developed a machine learning algorithm to automatically detect EDR candidates, in order to increase the available list of events identified in the literature. However, identifying the parameters that are the most relevant to describe EDRs is complex, all the more that some of the small scale plasma/fields parameters show limitations in some configurations such as for low particle densities or large guide fields cases. In this study, we perform a statistical study of previously reported dayside EDRs as well as newly reported EDR candidates found using machine learning methods. We also show different single and multi-spacecraft parameters that can be used to better identify dayside EDRs in time series from MMS data recorded at the magnetopause. And finally we show an analysis of the link between the guide field and the strength of the energy conversion around each EDR.</p>

Download Full-text

A novel machine learning algorithm has the potential to reduce by 1/3 the quantity of ILR episodes needing review

European Heart Journal ◽

10.1093/eurheartj/ehab724.0316 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

A Rosier ◽

E Crespin ◽

A Lazarus ◽

G Laurent ◽

A Menet ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Learning Algorithm ◽

False Positive Rate ◽

Machine Learning Algorithm ◽

The Novel ◽

Funding Sources ◽

High Workload ◽

Positive Rate ◽

The Impact

Abstract Background Implantable Loop Recorders (ILRs) are increasingly used and generate a high workload for timely adjudication of ECG recordings. In particular, the excessive false positive rate leads to a significant review burden. Purpose A novel machine learning algorithm was developed to reclassify ILR episodes in order to decrease by 80% the False Positive rate while maintaining 99% sensitivity. This study aims to evaluate the impact of this algorithm to reduce the number of abnormal episodes reported in Medtronic ILRs. Methods Among 20 European centers, all Medtronic ILR patients were enrolled during the 2nd semester of 2020. Using a remote monitoring platform, every ILR transmitted episode was collected and anonymised. For every ILR detected episode with a transmitted ECG, the new algorithm reclassified it applying the same labels as the ILR (asystole, brady, AT/AF, VT, artifact, normal). We measured the number of episodes identified as false positive and reclassified as normal by the algorithm, and their proportion among all episodes. Results In 370 patients, ILRs recorded 3755 episodes including 305 patient-triggered and 629 with no ECG transmitted. 2821 episodes were analyzed by the novel algorithm, which reclassified 1227 episodes as normal rhythm. These reclassified episodes accounted for 43% of analyzed episodes and 32.6% of all episodes recorded. Conclusion A novel machine learning algorithm significantly reduces the quantity of episodes flagged as abnormal and typically reviewed by healthcare professionals. FUNDunding Acknowledgement Type of funding sources: None. Figure 1. ILR episodes analysis

Download Full-text

A Hypercuboid-Based Machine Learning Algorithm for Malware Classification

10.1109/rivf51545.2021.9642093 ◽

2021 ◽

Author(s):

Thi Thu Trang Nguyen ◽

Dai Tho Nguyen ◽

Duy Loi Vu

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Malware Classification

Download Full-text

Is Machine Learning a Better Way to Identify COVID-19 Patients Who Might Benefit from Hydroxychloroquine Treatment?—The IDENTIFY Trial

Journal of Clinical Medicine ◽

10.3390/jcm9123834 ◽

2020 ◽

Vol 9 (12) ◽

pp. 3834

Author(s):

Hoyt Burdick ◽

Carson Lam ◽

Samson Mataraso ◽

Anna Siefkas ◽

Gregory Braden ◽

...

Keyword(s):

Machine Learning ◽

Clinical Trial ◽

Learning Algorithm ◽

Pragmatic Trial ◽

Therapeutic Agents ◽

Study Endpoint ◽

Machine Learning Algorithm ◽

The Novel ◽

Hazard Ratios ◽

Novel Coronavirus

Therapeutic agents for the novel coronavirus disease 2019 (COVID-19) have been proposed, but evidence supporting their use is limited. A machine learning algorithm was developed in order to identify a subpopulation of COVID-19 patients for whom hydroxychloroquine was associated with improved survival; this population might be relevant for study in a clinical trial. A pragmatic trial was conducted at six United States hospitals. We enrolled COVID-19 patients that were admitted between 10 March and 4 June 2020. Treatment was not randomized. The study endpoint was mortality; discharge was a competing event. Hazard ratios were obtained on the entire population, and on the subpopulation indicated by the algorithm as suitable for treatment. A total of 290 patients were enrolled. In the subpopulation that was identified by the algorithm, hydroxychloroquine was associated with a statistically significant (p = 0.011) increase in survival (adjusted hazard ratio 0.29, 95% confidence interval (CI) 0.11–0.75). Adjusted survival among the algorithm indicated patients was 82.6% in the treated arm and 51.2% in the arm not treated. No association between treatment and mortality was observed in the general population. A 31% increase in survival at the end of the study was observed in a population of COVID-19 patients that were identified by a machine learning algorithm as having a better outcome with hydroxychloroquine treatment. Precision medicine approaches may be useful in identifying a subpopulation of COVID-19 patients more likely to be proven to benefit from hydroxychloroquine treatment in a clinical trial.

Download Full-text

Inside the Black Box: A Physical Basis for the Effectiveness of Deep Generative Models of Amorphous Materials

10.26434/chemrxiv.14696595 ◽

2021 ◽

Author(s):

Michael Kilgour ◽

Lena Simine

Keyword(s):

Machine Learning ◽

Amorphous Materials ◽

Physical Simulation ◽

Generative Models ◽

Black Box ◽

Small Scale ◽

Learning Approach ◽

Computational Costs ◽

Training Samples ◽

Machine Learning Approach

<p>We have recently demonstrated an effective protocol for the simulation of amorphous molecular configurations using the PixelCNN generative model (J. Phys. Chem. Lett. 2020, 11, 20, 8532). The morphological sampling of amorphous materials via such an autoregressive generation protocol sidesteps the high computational costs associated with simulating amorphous materials at scale, enabling practically unlimited structural sampling based on only small-scale experimental or computational training samples. An important question raised but not rigorously addressed in that report was whether this machine learning approach could be considered a physical simulation in the conventional sense. Here we answer this question by detailing the inner workings of the underlying algorithm that we refer to as the Morphological Autoregression Protocol or MAP. <br></p>

Download Full-text

Energy Audit in Households using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b4130.079220 ◽

2020 ◽

Vol 9 (2) ◽

pp. 1153-1160

Keyword(s):

Machine Learning ◽

Supply Chain ◽

Transmission Lines ◽

Power Transmission ◽

Learning Algorithm ◽

Electricity Consumption ◽

Machine Learning Algorithms ◽

Small Scale ◽

Usage Pattern ◽

To Come

Maintaining the energy usage with minimal power loss throughout the supply chain is of the major issues faced in many small-scale sectors or even in households of today’s world. Even though Power transmission can play a cardinal role in the supply chain, monitoring the transmission lines for energy leakage or any faulty connections is critically important. There have been several measures taken to come up with a better solution but, the problem of finding a consistent method for monitoring the power leakage is still at peril. There are actually many ways of saving the energy by mitigating the usage and preventing the loss of energy due to over usage and wastages, for this a thorough monitoring and study of the usage should be done. If the electricity usage pattern of the concerned is identified, then it will be facile to come up with a solution for the problem at hand. The electricity wastage constituted by all the countries aggregated is found out to be around 8.25%, which is considerately large given that many places around the world does not even have access to electricity. So, there is a need to find a better solution for this problem. After conducting a thorough study on the electricity usage pattern of several households we are proposing a method which is an ensemble of machine learning algorithms, Internet of Things, sensors, Embedded systems. Using an IoT device we’ve designed we monitoring and collecting electricity usage in households in a time based manner. These collected data is stored in the database and is processed and fed into machine learning algorithm to predict the upcoming month’s electricity usage. This predicted data is then fed into another algorithm to provide recommendations to the user to reduce the electricity consumption according to their usage interests. Thus reducing the cost significantly.

Download Full-text

A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1854-1862 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1854 ◽

Cited By ~ 1

Author(s):

Md. Armanur Rahman ◽

J. Hossen ◽

Venkataseshaiah C ◽

CK Ho ◽

Tan Kim Geok ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Algorithm ◽

Machine Learning Techniques ◽

Apache Hadoop ◽

Deep Learning Algorithm ◽

Learning Techniques ◽

Critical Issues ◽

Self Tuning ◽

Hadoop System

The Apache Hadoop framework is an open source implementation of MapReduce for processing and storing big data. However, to get the best performance from this is a big challenge because of its large number configuration parameters. In this paper, the concept of critical issues of Hadoop system, big data and machine learning have been highlighted and an analysis of some machine learning techniques applied so far, for improving the Hadoop performance is presented. Then, a promising machine learning technique using deep learning algorithm is proposed for Hadoop system performance improvement.

Download Full-text

Light GBM Machine Learning Algorithm to Online Click Fraud Detection

Journal of Information Assurance & Cybersecurity ◽

10.5171/2019.263928 ◽

2019 ◽

pp. 1-12

Author(s):

Elena-Adriana MINASTIREANU ◽

Gabriela MESNITA

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Online Advertising ◽

Fraud Detection ◽

Online Marketing ◽

Machine Learning Algorithm ◽

Advertising Industry ◽

Click Fraud ◽

Web Advertising ◽

Critical Issues

In the current web advertising activities, the fraud increases the number of risks for online marketing, advertising industry and e-business. The click fraud is considered one of the most critical issues in online advertising. Even if the online advertisers make permanent efforts to improve the traffic filtering techniques, they are still looking for the best protection methods to detect click frauds.

Download Full-text