scholarly journals MIGAN: Malware Image Synthesis Using GANs

Author(s):  
Abhishek Singh ◽  
Debojyoti Dutta ◽  
Amit Saha

Majority of the advancement in Deep learning (DL) has occurred in domains such as computer vision, and natural language processing, where abundant training data is available. A major obstacle in leveraging DL techniques for malware analysis is the lack of sufficiently big, labeled datasets. In this paper, we take the first steps towards building a model which can synthesize labeled dataset of malware images using GAN. Such a model can be utilized to perform data augmentation for training a classifier. Furthermore, the model can be shared publicly for community to reap benefits of dataset without sharing the original dataset. First, we show the underlying idiosyncrasies of malware images and why existing data augmentation techniques as well as traditional GAN training fail to produce quality artificial samples. Next, we propose a new method for training GAN where we explicitly embed prior domain knowledge about the dataset into the training procedure. We show improvements in training stability and sample quality assessed on different metrics. Our experiments show substantial improvement on baselines and promise for using such a generative model for malware visualization systems.

2019 ◽  
Vol 28 (1) ◽  
pp. 3-12
Author(s):  
Jarosław Kurek ◽  
Joanna Aleksiejuk-Gawron ◽  
Izabella Antoniuk ◽  
Jarosław Górski ◽  
Albina Jegorowa ◽  
...  

This paper presents an improved method for recognizing the drill state on the basis of hole images drilled in a laminated chipboard, using convolutional neural network (CNN) and data augmentation techniques. Three classes were used to describe the drill state: red -- for drill that is worn out and should be replaced, yellow -- for state in which the system should send a warning to the operator, indicating that this element should be checked manually, and green -- denoting the drill that is still in good condition, which allows for further use in the production process. The presented method combines the advantages of transfer learning and data augmentation methods to improve the accuracy of the received evaluations. In contrast to the classical deep learning methods, transfer learning requires much smaller training data sets to achieve acceptable results. At the same time, data augmentation customized for drill wear recognition makes it possible to expand the original dataset and to improve the overall accuracy. The experiments performed have confirmed the suitability of the presented approach to accurate class recognition in the given problem, even while using a small original dataset.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Huu-Thanh Duong ◽  
Tram-Anh Nguyen-Thi

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.


2022 ◽  
Vol 18 (1) ◽  
pp. 1-24
Author(s):  
Yi Zhang ◽  
Yue Zheng ◽  
Guidong Zhang ◽  
Kun Qian ◽  
Chen Qian ◽  
...  

Gait, the walking manner of a person, has been perceived as a physical and behavioral trait for human identification. Compared with cameras and wearable sensors, Wi-Fi-based gait recognition is more attractive because Wi-Fi infrastructure is almost available everywhere and is able to sense passively without the requirement of on-body devices. However, existing Wi-Fi sensing approaches impose strong assumptions of fixed user walking trajectories, sufficient training data, and identification of already known users. In this article, we present GaitSense , a Wi-Fi-based human identification system, to overcome the above unrealistic assumptions. To deal with various walking trajectories and speeds, GaitSense first extracts target specific features that best characterize gait patterns and applies novel normalization algorithms to eliminate gait irrelevant perturbation in signals. On this basis, GaitSense reduces the training efforts in new deployment scenarios by transfer learning and data augmentation techniques. GaitSense also enables a distinct feature of illegal user identification by anomaly detection, making the system readily available for real-world deployment. Our implementation and evaluation with commodity Wi-Fi devices demonstrate a consistent identification accuracy across various deployment scenarios with little training samples, pushing the limit of gait recognition with Wi-Fi signals.


Author(s):  
Peilian Zhao ◽  
Cunli Mao ◽  
Zhengtao Yu

Aspect-Based Sentiment Analysis (ABSA), a fine-grained task of opinion mining, which aims to extract sentiment of specific target from text, is an important task in many real-world applications, especially in the legal field. Therefore, in this paper, we study the problem of limitation of labeled training data required and ignorance of in-domain knowledge representation for End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) in legal field. We proposed a new method under deep learning framework, named Semi-ETEKGs, which applied E2E framework using knowledge graph (KG) embedding in legal field after data augmentation (DA). Specifically, we pre-trained the BERT embedding and in-domain KG embedding for unlabeled data and labeled data with case elements after DA, and then we put two embeddings into the E2E framework to classify the polarity of target-entity. Finally, we built a case-related dataset based on a popular benchmark for ABSA to prove the efficiency of Semi-ETEKGs, and experiments on case-related dataset from microblog comments show that our proposed model outperforms the other compared methods significantly.


GEOMATICA ◽  
2020 ◽  
Author(s):  
Qinjun Qiu ◽  
Zhong Xie ◽  
Liang Wu

Unlike English and other western languages, Chinese does not delimit words using white-spaces. Chinese Word Segmentation (CWS) is the crucial first step towards natural language processing. However, for the geoscience subject domain, the CWS problem remains unresolved with many challenges. Although traditional methods can be used to process geoscience documents, they lack the domain knowledge for massive geoscience documents. Considering the above challenges, this motivated us to build a segmenter specifically for the geoscience domain. Currently, most of the state-of-the-art methods for Chinese word segmentation are based on supervised learning, whose features are mostly extracted from a local context. In this paper, we proposed a framework for sequence learning by incorporating cyclic self-learning corpus training. Following this framework, we build the GeoSegmenter based on the Bi-directional Long Short-Term Memory (Bi-LSTM) network model to perform Chinese word segmentation. It can gain a great advantage through iterations of the training data. Empirical experimental results on geoscience documents and benchmark datasets showed that geological documents can be identified, and it can also recognize the generic documents.


Author(s):  
Juntao Li ◽  
Lisong Qiu ◽  
Bo Tang ◽  
Dongmin Chen ◽  
Dongyan Zhao ◽  
...  

Recent successes of open-domain dialogue generation mainly rely on the advances of deep neural networks. The effectiveness of deep neural network models depends on the amount of training data. As it is laboursome and expensive to acquire a huge amount of data in most scenarios, how to effectively utilize existing data is the crux of this issue. In this paper, we use data augmentation techniques to improve the performance of neural dialogue models on the condition of insufficient data. Specifically, we propose a novel generative model to augment existing data, where the conditional variational autoencoder (CVAE) is employed as the generator to output more training data with diversified expressions. To improve the correlation of each augmented training pair, we design a discriminator with adversarial training to supervise the augmentation process. Moreover, we thoroughly investigate various data augmentation schemes for neural dialogue system with generative models, both GAN and CVAE. Experimental results on two open corpora, Weibo and Twitter, demonstrate the superiority of our proposed data augmentation model.


2019 ◽  
Vol 5 (1) ◽  
pp. 239-244
Author(s):  
Jingrui Yu ◽  
Roman Seidel ◽  
Gangolf Hirtz

AbstractWe propose a one-step person detector for topview omnidirectional indoor scenes based on convolutional neural networks (CNNs). While state of the art person detectors reach competitive results on perspective images, missing CNN architectures as well as training data that follows the distortion of omnidirectional images makes current approaches not applicable to our data. The method predicts bounding boxes of multiple persons directly in omnidirectional images without perspective transformation, which reduces overhead of pre- and post-processing and enables realtime performance. The basic idea is to utilize transfer learning to fine-tune CNNs trained on perspective images with data augmentation techniques for detection in omnidirectional images. We fine-tune two variants of Single Shot MultiBox detectors (SSDs). The first one uses Mobilenet v1 FPN as feature extractor (moSSD). The second one uses ResNet50 v1 FPN (resSSD). Both models are pre-trained on Microsoft Common Objects in Context (COCO) dataset. We fine-tune both models on PASCAL VOC07 and VOC12 datasets, specifically on class person. Random 90-degree rotation and random vertical flipping are used for data augmentation in addition to the methods proposed by original SSD. We reach an average precision (AP) of 67.3%with moSSD and 74.9%with resSSD on the evaluation dataset. To enhance the fine-tuning process, we add a subset of HDA Person dataset and a subset of PIROPO database and reduce the number of perspective images to PASCAL VOC07. The AP rises to 83.2% for moSSD and 86.3% for resSSD, respectively. The average inference speed is 28 ms per image for moSSD and 38 ms per image for resSSD using Nvidia Quadro P6000. Our method is applicable to other CNN-based object detectors and can potentially generalize for detecting other objects in omnidirectional images.


2020 ◽  
Vol 10 (7) ◽  
pp. 1494-1505
Author(s):  
Hyo-Hun Kim ◽  
Byung-Woo Hong

In this work, we present an image segmentation algorithm based on the convolutional neural network framework where the scale space theory is incorporated in the course of training procedure. The construction of data augmentation is designed to apply the scale space to the training data in order to effectively deal with the variability of regions of interest in geometry and appearance such as shape and contrast. The proposed data augmentation algorithm via scale space is aimed to improve invariant features with respect to both geometry and appearance by taking into consideration of their diffusion process. We develop a segmentation algorithm based on the convolutional neural network framework where the network architecture consists of encoding and decoding substructures in combination with the data augmentation scheme via the scale space induced by the heat equation. The quantitative analysis using the cardiac MRI dataset indicates that the proposed algorithm achieves better accuracy in the delineation of the left ventricles, which demonstrates the potential of the algorithm in the application of the whole heart segmentation as a compute-aided diagnosis system for the cardiac diseases.


2021 ◽  
Vol 11 (9) ◽  
pp. 842
Author(s):  
Shruti Atul Mali ◽  
Abdalla Ibrahim ◽  
Henry C. Woodruff ◽  
Vincent Andrearczyk ◽  
Henning Müller ◽  
...  

Radiomics converts medical images into mineable data via a high-throughput extraction of quantitative features used for clinical decision support. However, these radiomic features are susceptible to variation across scanners, acquisition protocols, and reconstruction settings. Various investigations have assessed the reproducibility and validation of radiomic features across these discrepancies. In this narrative review, we combine systematic keyword searches with prior domain knowledge to discuss various harmonization solutions to make the radiomic features more reproducible across various scanners and protocol settings. Different harmonization solutions are discussed and divided into two main categories: image domain and feature domain. The image domain category comprises methods such as the standardization of image acquisition, post-processing of raw sensor-level image data, data augmentation techniques, and style transfer. The feature domain category consists of methods such as the identification of reproducible features and normalization techniques such as statistical normalization, intensity harmonization, ComBat and its derivatives, and normalization using deep learning. We also reflect upon the importance of deep learning solutions for addressing variability across multi-centric radiomic studies especially using generative adversarial networks (GANs), neural style transfer (NST) techniques, or a combination of both. We cover a broader range of methods especially GANs and NST methods in more detail than previous reviews.


Author(s):  
NGOC TAN LE ◽  
Fatiha Sadat

With the emergence of the neural networks-based approaches, research on information extraction has benefited from large-scale raw texts by leveraging them using pre-trained embeddings and other data augmentation techniques to deal with challenges and issues in Natural Language Processing tasks. In this paper, we propose an approach using sequence-to-sequence neural networks-based models to deal with term extraction for low-resource domain. Our empirical experiments, evaluating on the multilingual ACTER dataset provided in the LREC-TermEval 2020 shared task on automatic term extraction, proved the efficiency of deep learning approach, in the case of low-data settings, for the automatic term extraction task.


Sign in / Sign up

Export Citation Format

Share Document