A Survey of Text Data Augmentation

Author(s):  
Pei Liu ◽  
Xuemin Wang ◽  
Chao Xiang ◽  
Weiye Meng
Keyword(s):  
2021 ◽  
Author(s):  
Connor Shorten ◽  
Taghi M. Khoshgoftaar ◽  
Borko Furht

Abstract Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.


2021 ◽  
pp. 198-210
Author(s):  
Fei Xia ◽  
Shizhu He ◽  
Kang Liu ◽  
Shengping Liu ◽  
Jun Zhao
Keyword(s):  

2021 ◽  
pp. 97-105
Author(s):  
Dominykas Šeputis

Data augmentation can improve model’s final accuracy by introducing new data samples to the dataset. In this paper, text data augmentation using translation technique is investigated. Synthetic translations, generated by Opus-MT model are compared to the unique foreign data samples in terms of an impact to the trans- former network-based models’ performance. The experimental results showed that multilingual models like DistilBERT in some cases benefit from the introduction of the addition artificially created data samples presented in a foreign language.


2021 ◽  
Vol 12 (5-2021) ◽  
pp. 22-34
Author(s):  
Pavel A. Lomov ◽  
◽  
Marina L. Malozemova ◽  

This paper is a continuation of the research focused on solving the problem of ontology population using training on an automatically generated training set and the subsequent use of a neural-network language model for analyzing texts in order to discover new concepts to add to the ontology. The article is devoted to the text data augmentation - increasing the size of the training set by modification of its samples. Along with this, a solution to the problem of clarifying concepts (i.e. adjusting their boundaries in sentences), which were found during the automatic formation of the training set, is considered. A brief overview of existing approaches to text data augmentation, as well as approaches to extracting so-called nested named entities (nested NER), is presented. A procedure is proposed for clarifying the boundaries of the discovered concepts of the training set and its augmentation for subsequent training a neural-network language model in order to identify new concepts of ontology in the domain texts. The results of the experimental evaluation of the trained model and the main directions of further research are considered.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Huilin Yuan ◽  
Yufan Song ◽  
Jianlu Hu ◽  
Yatao Ma

With the development of society, more and more attention has been paid to cultural festivals. In addition to the government’s emphasis, the increasing consumption in festivals also proves that cultural festivals are playing increasingly important role in public life. Therefore, it is very vital to grasp the public festival sentiment. Text sentiment analysis is an important research content in the field of machine learning in recent years. However, at present, there are few studies on festival sentiment, and sentiment classifiers are also limited by domain or language. The Chinese text classifier is much less than the English version. This paper takes Sina Weibo as the text information carrier and Chinese festival microblogs as the research object. CHN-EDA is used to do Chinese text data augmentation, and then the traditional classifiers CNN, DNN, and naïve Bayes are compared to obtain a higher accuracy. The matching optimizer is selected, and relevant parameters are determined through experiments. This paper solves the problem of unbalanced Chinese sentiment data and establishes a more targeted festival text classifier. This festival sentiment classifier can collect public festival emotion effectively, which is beneficial for cultural inheritance and business decisions adjustment.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Connor Shorten ◽  
Taghi M. Khoshgoftaar ◽  
Borko Furht

AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.


Author(s):  
Hugo Queiroz Abonizio ◽  
Emerson Cabrera Paraiso ◽  
Sylvio Barbon Junior

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Hai He ◽  
Haibo Yang

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.


Sign in / Sign up

Export Citation Format

Share Document