A Survey of Text Data Augmentation

Text Data Augmentation for Deep Learning

10.21203/rs.3.rs-650804/v1 ◽

2021 ◽

Author(s):

Connor Shorten ◽

Taghi M. Khoshgoftaar ◽

Borko Furht

Keyword(s):

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Early Stage ◽

Practical Implementation ◽

Text Data ◽

Training Strategy ◽

Local Decision ◽

Decision Boundaries

Abstract Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Download Full-text

Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances

10.1007/978-981-16-6471-7_15 ◽

2021 ◽

pp. 198-210

Author(s):

Fei Xia ◽

Shizhu He ◽

Kang Liu ◽

Shengping Liu ◽

Jun Zhao

Keyword(s):

Data Augmentation ◽

Text Data

Download Full-text

A Text Data Augmentation Approach for Improving the Performance of CNN

2019 11th International Conference on Communication Systems & Networks (COMSNETS) ◽

10.1109/comsnets.2019.8711054 ◽

2019 ◽

Cited By ~ 1

Author(s):

Muhammad Abulaish ◽

Amit Kumar Sah

Keyword(s):

Data Augmentation ◽

Text Data

Download Full-text

Investigation of text data augmentation for transformer training via translation technique

Vilnius University Open Series ◽

10.15388/lmitt.2021.11 ◽

2021 ◽

pp. 97-105

Author(s):

Dominykas Šeputis

Keyword(s):

Foreign Language ◽

Data Augmentation ◽

Experimental Results ◽

Text Data ◽

Translation Technique

Data augmentation can improve model’s final accuracy by introducing new data samples to the dataset. In this paper, text data augmentation using translation technique is investigated. Synthetic translations, generated by Opus-MT model are compared to the unique foreign data samples in terms of an impact to the trans- former network-based models’ performance. The experimental results showed that multilingual models like DistilBERT in some cases benefit from the introduction of the addition artificially created data samples presented in a foreign language.

Download Full-text

Text data-augmentation using Text Similarity with Manhattan Siamese long short-term memory for Thai language

Journal of Physics Conference Series ◽

10.1088/1742-6596/1780/1/012018 ◽

2021 ◽

Vol 1780 (1) ◽

pp. 012018

Author(s):

Thananya Phreeraphattanakarn ◽

Boonserm Kijsirikul

Keyword(s):

Data Augmentation ◽

Short Term Memory ◽

Short Term ◽

Text Similarity ◽

Text Data ◽

Term Memory ◽

Thai Language ◽

Long Short Term Memory

Download Full-text

Training set augmentation in training neural- network language model for ontology population

Transaction Kola Science Cetnre ◽

10.37614/2307-5252.2021.5.12.002 ◽

2021 ◽

Vol 12 (5-2021) ◽

pp. 22-34

Author(s):

Pavel A. Lomov ◽

◽

Marina L. Malozemova ◽

Keyword(s):

Neural Network ◽

Data Augmentation ◽

Language Model ◽

Training Set ◽

Text Data ◽

Named Entities ◽

New Concepts ◽

Ontology Population ◽

Network Language ◽

Automatic Formation

This paper is a continuation of the research focused on solving the problem of ontology population using training on an automatically generated training set and the subsequent use of a neural-network language model for analyzing texts in order to discover new concepts to add to the ontology. The article is devoted to the text data augmentation - increasing the size of the training set by modification of its samples. Along with this, a solution to the problem of clarifying concepts (i.e. adjusting their boundaries in sentences), which were found during the automatic formation of the training set, is considered. A brief overview of existing approaches to text data augmentation, as well as approaches to extracting so-called nested named entities (nested NER), is presented. A procedure is proposed for clarifying the boundaries of the discovered concepts of the training set and its augmentation for subsequent training a neural-network language model in order to identify new concepts of ontology in the domain texts. The results of the experimental evaluation of the trained model and the main directions of further research are considered.

Download Full-text

Design of Festival Sentiment Classifier Based on Social Network

Computational Intelligence and Neuroscience ◽

10.1155/2020/8824009 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Huilin Yuan ◽

Yufan Song ◽

Jianlu Hu ◽

Yatao Ma

Keyword(s):

Chinese Text ◽

Data Augmentation ◽

Cultural Inheritance ◽

Text Data ◽

Business Decisions ◽

Sina Weibo ◽

The Public ◽

Cultural Festivals ◽

Text Information ◽

Text Sentiment Analysis

With the development of society, more and more attention has been paid to cultural festivals. In addition to the government’s emphasis, the increasing consumption in festivals also proves that cultural festivals are playing increasingly important role in public life. Therefore, it is very vital to grasp the public festival sentiment. Text sentiment analysis is an important research content in the field of machine learning in recent years. However, at present, there are few studies on festival sentiment, and sentiment classifiers are also limited by domain or language. The Chinese text classifier is much less than the English version. This paper takes Sina Weibo as the text information carrier and Chinese festival microblogs as the research object. CHN-EDA is used to do Chinese text data augmentation, and then the traditional classifiers CNN, DNN, and naïve Bayes are compared to obtain a higher accuracy. The matching optimizer is selected, and relevant parameters are determined through experiments. This paper solves the problem of unbalanced Chinese sentiment data and establishes a more targeted festival text classifier. This festival sentiment classifier can collect public festival emotion effectively, which is beneficial for cultural inheritance and business decisions adjustment.

Download Full-text

Text Data Augmentation for Deep Learning

Journal Of Big Data ◽

10.1186/s40537-021-00492-0 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Connor Shorten ◽

Taghi M. Khoshgoftaar ◽

Borko Furht

Keyword(s):

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Early Stage ◽

Practical Implementation ◽

Text Data ◽

Training Strategy ◽

Local Decision ◽

Decision Boundaries

AbstractNatural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Download Full-text

Toward Text Data Augmentation for Sentiment Analysis

IEEE Transactions on Artificial Intelligence ◽

10.1109/tai.2021.3114390 ◽

2021 ◽

pp. 1-1

Author(s):

Hugo Queiroz Abonizio ◽

Emerson Cabrera Paraiso ◽

Sylvio Barbon Junior

Keyword(s):

Sentiment Analysis ◽

Data Augmentation ◽

Text Data

Download Full-text

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Mathematical Problems in Engineering ◽

10.1155/2021/6654071 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Hai He ◽

Haibo Yang

Keyword(s):

Data Augmentation ◽

Image Annotation ◽

Feature Space ◽

Word Embedding ◽

Training Dataset ◽

Text Data ◽

Semantic Embedding ◽

Proposed Model ◽

Language And Vision ◽

Ranking Loss

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.

Download Full-text