scholarly journals Text Data Augmentation for Deep Learning

Author(s):  
Connor Shorten ◽  
Taghi M. Khoshgoftaar ◽  
Borko Furht

Abstract Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Connor Shorten ◽  
Taghi M. Khoshgoftaar ◽  
Borko Furht

AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.


2020 ◽  
Author(s):  
Olaide N. Oyelade ◽  
Absalom E. Ezugwu

AbstractThe novel Coronavirus, also known as Covid19, is a pandemic that has weighed heavily on the socio-economic affairs of the world. Although researches into the production of relevant vaccine are being advanced, there is, however, a need for a computational solution to mediate the process of aiding quick detection of the disease. Different computational solutions comprised of natural language processing, knowledge engineering and deep learning have been adopted for this task. However, deep learning solutions have shown interesting performance compared to other methods. This paper therefore aims to advance the application deep learning technique to the problem of characterization and detection of novel coronavirus. The approach adopted in this study proposes a convolutional neural network (CNN) model which is further enhanced using the technique of data augmentation. The motive for the enhancement of the CNN model through the latter technique is to investigate the possibility of further improving the performances of deep learning models in detection of coronavirus. The proposed model is then applied to the COVID-19 X-ray dataset in this study which is the National Institutes of Health (NIH) Chest X-Ray dataset obtained from Kaggle for the purpose of promoting early detection and screening of coronavirus disease. Results obtained showed that our approach achieved a performance of 100% accuracy, recall/precision of 0.85, F-measure of 0.9, and specificity of 1.0. The proposed CNN model and data augmentation solution may be adopted in pre-screening suspected cases of Covid19 to provide support to the use of the well-known RT-PCR testing.


2021 ◽  
Vol 11 (7) ◽  
pp. 3119
Author(s):  
Cristina L. Saratxaga ◽  
Jorge Bote ◽  
Juan F. Ortega-Morán ◽  
Artzai Picón ◽  
Elena Terradillos ◽  
...  

(1) Background: Clinicians demand new tools for early diagnosis and improved detection of colon lesions that are vital for patient prognosis. Optical coherence tomography (OCT) allows microscopical inspection of tissue and might serve as an optical biopsy method that could lead to in-situ diagnosis and treatment decisions; (2) Methods: A database of murine (rat) healthy, hyperplastic and neoplastic colonic samples with more than 94,000 images was acquired. A methodology that includes a data augmentation processing strategy and a deep learning model for automatic classification (benign vs. malignant) of OCT images is presented and validated over this dataset. Comparative evaluation is performed both over individual B-scan images and C-scan volumes; (3) Results: A model was trained and evaluated with the proposed methodology using six different data splits to present statistically significant results. Considering this, 0.9695 (±0.0141) sensitivity and 0.8094 (±0.1524) specificity were obtained when diagnosis was performed over B-scan images. On the other hand, 0.9821 (±0.0197) sensitivity and 0.7865 (±0.205) specificity were achieved when diagnosis was made considering all the images in the whole C-scan volume; (4) Conclusions: The proposed methodology based on deep learning showed great potential for the automatic characterization of colon polyps and future development of the optical biopsy paradigm.


2021 ◽  
Author(s):  
Khloud Al Jallad

Abstract New Attacks are increasingly used by attackers every day but many of them are not detected by Intrusion Detection Systems as most IDS ignore raw packet information and only care about some basic statistical information extracted from PCAP files. Using networking programs to extract fixed statistical features from packets is good, but may not enough to detect nowadays challenges. We think that it is time to utilize big data and deep learning for automatic dynamic feature extraction from packets. It is time to get inspired by deep learning pre-trained models in computer vision and natural language processing, so security deep learning solutions will have its pre-trained models on big datasets to be used in future researches. In this paper, we proposed a new approach for embedding packets based on character-level embeddings, inspired by FastText success on text data. We called this approach FastPacket. Results are measured on subsets of CIC-IDS-2017 dataset, but we expect promising results on big data pre-trained models. We suggest building pre-trained FastPacket on MAWI big dataset and make it available to community, similar to FastText. To be able to outperform currently used NIDS, to start a new era of packet-level NIDS that can better detect complex attacks


Lot of research has gone into Natural language processing and the state of the art algorithms in deep learning that unambiguously helps in converting an English text into a data structure without loss of meaning. Also with the advent of neural networks for learning word representations as vectors has helped a lot in revolutionizing the automatic feature extraction from text data corpus. A combination of word embedding and the use of a deep learning algorithm like a convolution neural network helped in better accuracy for text classification. In this era of Internet of things and the voluminous amounts of data that is overwhelming the users determining the veracity of the data is a very challenging task. There are many truth discovery algorithms in literature that help in resolving the conflicts that arise due to multiple sources of data. These algorithms help in estimating the trustworthiness of the data and reliability of the sources. In this paper, a convolution based truth discovery with multitasking is proposed to estimate the genuineness of the data for a given text corpus. The proposed algorithm has been tested on analysing the genuineness of Quora questions dataset and experimental results showed an improved accuracy and speed over other existing approaches.


Author(s):  
Zhiwen Xiong

AbstractMachine learning is a branch of the field of artificial intelligence. Deep learning is a complex machine learning algorithm that has unique advantages in image recognition, speech recognition, natural language processing, and industrial process control. Deep learning has It is widely used in the field of wireless communication. Prediction of geological disasters (such as landslides) is currently a difficult problem. Because landslides are difficult to detect in the early stage, this paper proposes a GPS-based wireless communication continuous detection system and applies it to landslide deformation monitoring to achieve early treatment and prevention. This article introduces the GPS multi-antenna detection system based on deep learning wireless communication, and introduces the time series analysis method and its application. The test results show that the GPS multi-antenna detection system of the wireless communication network has great advantages in response time, with high accuracy and small error. The horizontal accuracy is controlled at 0–2 mm and the vertical accuracy is about 1 mm. The analysis method is simple and efficient, and can obtain good results for short-term deformation prediction.


Author(s):  
Yuchen Hou ◽  
Lawrence B. Holder

Abstract Deep learning has been successful in various domains including image recognition, speech recognition and natural language processing. However, the research on its application in graph mining is still in an early stage. Here we present Model R, a neural network model created to provide a deep learning approach to the link weight prediction problem. This model uses a node embedding technique that extracts node embeddings (knowledge of nodes) from the known links’ weights (relations between nodes) and uses this knowledge to predict the unknown links’ weights. We demonstrate the power of Model R through experiments and compare it with the stochastic block model and its derivatives. Model R shows that deep learning can be successfully applied to link weight prediction and it outperforms stochastic block model and its derivatives by up to 73% in terms of prediction accuracy. We analyze the node embeddings to confirm that closeness in embedding space correlates with stronger relationships as measured by the link weight. We anticipate this new approach will provide effective solutions to more graph mining tasks.


Author(s):  
Vincent Karas ◽  
Björn W. Schuller

Sentiment analysis is an important area of natural language processing that can help inform business decisions by extracting sentiment information from documents. The purpose of this chapter is to introduce the reader to selected concepts and methods of deep learning and show how deep models can be used to increase performance in sentiment analysis. It discusses the latest advances in the field and covers topics including traditional sentiment analysis approaches, the fundamentals of sentence modelling, popular neural network architectures, autoencoders, attention modelling, transformers, data augmentation methods, the benefits of transfer learning, the potential of adversarial networks, and perspectives on explainable AI. The authors' intent is that through this chapter, the reader can gain an understanding of recent developments in this area as well as current trends and potentials for future research.


2020 ◽  
Author(s):  
vinayakumar R

<p><b>Social media is a platform in which tons and tons of text are generated each and every day. The data is so large that cannot be easily understood, so this has paved a path to a new field in the information technology which is natural language processing. In this paper, the text data which is used for the classification is tweets that determines the state of the person according of the sentiments which is positive, negative and neutral. Emotions are the way of expression of the person’s feelings which has a high influence on the decision making tasks. Here we have proposed the text representation, Term Frequency Inverse Document Frequency (tfidf), Keras embedding along with the machine learning and deep learning algorithms for the purpose of the classification of the sentiments, out of which Logistics Regression machine learning based methods out performs well when the features is taken in the limited amount as the features increases Support Vector Machine (SVM) which is also one of the machine learning algorithm out performs well making a benchmark accuracy for this dataset as the 75.8%. For the research purpose the dataset has been made publically available.</b><b></b></p>


2022 ◽  
Author(s):  
Ms. Aayushi Bansal ◽  
Dr. Rewa Sharma ◽  
Dr. Mamta Kathuria

Recent advancements in deep learning architecture have increased its utility in real-life applications. Deep learning models require a large amount of data to train the model. In many application domains, there is a limited set of data available for training neural networks as collecting new data is either not feasible or requires more resources such as in marketing, computer vision, and medical science. These models require a large amount of data to avoid the problem of overfitting. One of the data space solutions to the problem of limited data is data augmentation. The purpose of this study focuses on various data augmentation techniques that can be used to further improve the accuracy of a neural network. This saves the cost and time consumption required to collect new data for the training of deep neural networks by augmenting available data. This also regularizes the model and improves its capability of generalization. The need for large datasets in different fields such as computer vision, natural language processing, security and healthcare is also covered in this survey paper. The goal of this paper is to provide a comprehensive survey of recent advancements in data augmentation techniques and their application in various domains.


Sign in / Sign up

Export Citation Format

Share Document