Multiple Data Augmentation Strategies for Improving Performance on Automatic Short Answer Scoring

Jiaqi Lun; Jia Zhu; Yong Tang; Min Yang

doi:10.1609/aaai.v34i09.7062

Multiple Data Augmentation Strategies for Improving Performance on Automatic Short Answer Scoring

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i09.7062 ◽

2020 ◽

Vol 34 (09) ◽

pp. 13389-13396 ◽

Cited By ~ 2

Author(s):

Jiaqi Lun ◽

Jia Zhu ◽

Yong Tang ◽

Min Yang

Keyword(s):

Natural Language ◽

Language Processing ◽

Data Augmentation ◽

Training Data ◽

Short Answer ◽

External Knowledge ◽

Multiple Data ◽

Language Representation ◽

Augmentation Strategies ◽

Back Translation

Automatic short answer scoring (ASAS) is a research subject of intelligent education, which is a hot field of natural language understanding. Many experiments have confirmed that the ASAS system is not good enough, because its performance is limited by the training data. Focusing on the problem, we propose MDA-ASAS, multiple data augmentation strategies for improving performance on automatic short answer scoring. MDA-ASAS is designed to learn language representation enhanced by data augmentation strategies, which includes back-translation, correct answer as reference answer, and swap content. We argue that external knowledge has a profound impact on the ASAS process. Meanwhile, the Bidirectional Encoder Representations from Transformers (BERT) model has been shown to be effective for improving many natural language processing tasks, which acquires more semantic, grammatical and other features in large amounts of unsupervised data, and actually adds external knowledge. Combining with the latest BERT model, our experimental results on the ASAS dataset show that MDA-ASAS brings a significant gain over state-of-art. We also perform extensive ablation studies and suggest parameters for practical use.

Download Full-text

UMLS-based data augmentation for natural language processing of clinical research literature

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa309 ◽

2020 ◽

Author(s):

Tian Kang ◽

Adler Perotte ◽

Youlan Tang ◽

Casey Ta ◽

Chunhua Weng

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Augmentation ◽

Training Data ◽

Entity Recognition ◽

Classification Model ◽

Learning Models ◽

Sentence Classification

Abstract Objective The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. Materials and Methods We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. Results UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). Conclusions This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

Download Full-text

An AdaBoost Using a Weak-Learner Generating Several Weak-Hypotheses for Large Training Data of Natural Language Processing

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.130.83 ◽

2010 ◽

Vol 130 (1) ◽

pp. 83-91 ◽

Cited By ~ 1

Author(s):

Tomoya Iwakura ◽

Seishi Okamoto ◽

Kazuo Asakawa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Weak Learner

Download Full-text

Building Damage Detection from Post-Event Aerial Imagery Using Single Shot Multibox Detector

Applied Sciences ◽

10.3390/app9061128 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1128 ◽

Cited By ~ 12

Author(s):

Yundong Li ◽

Wei Hu ◽

Han Dong ◽

Xueyan Zhang

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Hurricane Sandy ◽

Training Data ◽

Aerial Images ◽

Detection Methods ◽

Single Shot ◽

Data Set ◽

Augmentation Strategies ◽

Post Disaster

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.

Download Full-text

A Sequential and Intensive Weighted Language Modeling Scheme for Multi-Task Learning-Based Natural Language Understanding

Applied Sciences ◽

10.3390/app11073095 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3095

Author(s):

Suhyune Son ◽

Seonjeong Hwang ◽

Sohyeun Bae ◽

Soo Jun Park ◽

Jang-Hwan Choi

Keyword(s):

Natural Language ◽

Language Processing ◽

Empirical Investigation ◽

Natural Language Understanding ◽

Language Modeling ◽

Language Understanding ◽

Task Learning ◽

Language Representation ◽

Internal Transfer ◽

Two Stages

Multi-task learning (MTL) approaches are actively used for various natural language processing (NLP) tasks. The Multi-Task Deep Neural Network (MT-DNN) has contributed significantly to improving the performance of natural language understanding (NLU) tasks. However, one drawback is that confusion about the language representation of various tasks arises during the training of the MT-DNN model. Inspired by the internal-transfer weighting of MTL in medical imaging, we introduce a Sequential and Intensive Weighted Language Modeling (SIWLM) scheme. The SIWLM consists of two stages: (1) Sequential weighted learning (SWL), which trains a model to learn entire tasks sequentially and concentrically, and (2) Intensive weighted learning (IWL), which enables the model to focus on the central task. We apply this scheme to the MT-DNN model and call this model the MTDNN-SIWLM. Our model achieves higher performance than the existing reference algorithms on six out of the eight GLUE benchmark tasks. Moreover, our model outperforms MT-DNN by 0.77 on average on the overall task. Finally, we conducted a thorough empirical investigation to determine the optimal weight for each GLUE task.

Download Full-text

Enhancing Natural Language Inference Using New and Expanded Training Data Sets and New Learning Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6371 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8504-8511

Author(s):

Arindam Mitra ◽

Ishan Shrivastava ◽

Chitta Baral

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Training Data ◽

Data Sets ◽

Learning Models ◽

New Learning ◽

Word Attention ◽

Attention Function

Natural Language Inference (NLI) plays an important role in many natural language processing tasks such as question answering. However, existing NLI modules that are trained on existing NLI datasets have several drawbacks. For example, they do not capture the notion of entity and role well and often end up making mistakes such as “Peter signed a deal” can be inferred from “John signed a deal”. As part of this work, we have developed two datasets that help mitigate such issues and make the systems better at understanding the notion of “entities” and “roles”. After training the existing models on the new dataset we observe that the existing models do not perform well on one of the new benchmark. We then propose a modification to the “word-to-word” attention function which has been uniformly reused across several popular NLI architectures. The resulting models perform as well as their unmodified counterparts on the existing benchmarks and perform significantly well on the new benchmarks that emphasize “roles” and “entities”.

Download Full-text

Learning Structural Kernels for Natural Language Processing

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00151 ◽

2015 ◽

Vol 3 ◽

pp. 461-473 ◽

Cited By ~ 3

Author(s):

Daniel Beck ◽

Trevor Cohn ◽

Christian Hardmeier ◽

Lucia Specia

Keyword(s):

Natural Language Processing ◽

Model Selection ◽

Natural Language ◽

Language Processing ◽

Bayesian Methods ◽

Training Data ◽

Coarse Grained ◽

Grid Search ◽

Hyperparameter Optimization ◽

Gradient Based

Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.

Download Full-text

Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6318 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8074-8081

Author(s):

Pavan Kapanipathi ◽

Veronika Thost ◽

Siva Sankalp Patel ◽

Spencer Whitehead ◽

Ibrahim Abdelaziz ◽

...

Keyword(s):

Language Processing ◽

Prediction Accuracy ◽

Semantic Information ◽

Training Data ◽

Personalized Pagerank ◽

External Knowledge ◽

Convolutional Networks ◽

Textual Entailment ◽

Knowledge Graphs ◽

Textual Content

Textual entailment is a fundamental task in natural language processing. Most approaches for solving this problem use only the textual content present in training data. A few approaches have shown that information from external knowledge sources like knowledge graphs (KGs) can add value, in addition to the textual content, by providing background knowledge that may be critical for a task. However, the proposed models do not fully exploit the information in the usually large and noisy KGs, and it is not clear how it can be effectively encoded to be useful for entailment. We present an approach that complements text-based entailment models with information from KGs by (1) using Personalized PageRank to generate contextual subgraphs with reduced noise and (2) encoding these subgraphs using graph convolutional networks to capture the structural and semantic information in KGs. We evaluate our approach on multiple textual entailment datasets and show that the use of external knowledge helps the model to be robust and improves prediction accuracy. This is particularly evident in the challenging BreakingNLI dataset, where we see an absolute improvement of 5-20% over multiple text-based entailment models.

Download Full-text

Deep Persian sentiment analysis: Cross-lingual training for low-resource languages

Journal of Information Science ◽

10.1177/0165551520962781 ◽

2020 ◽

pp. 016555152096278

Author(s):

Rouzbeh Ghasemi ◽

Seyed Arad Ashrafi Asli ◽

Saeedeh Momtazi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Low Resource ◽

Proposed Model ◽

Significant Difference ◽

Cross Lingual

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.

Download Full-text

Improving Brain Tumor Segmentation with Data Augmentation Strategies

10.21467/proceedings.114.2 ◽

2021 ◽

Author(s):

Radhika Malhotra ◽

Jasleen Saini ◽

Barjinder Singh Saini ◽

Savita Gupta

Keyword(s):

Brain Tumor ◽

Data Augmentation ◽

Large Data ◽

Training Data ◽

Tumor Segmentation ◽

Biomedical Image Processing ◽

Brain Tumor Segmentation ◽

Biomedical Image ◽

Augmentation Strategies ◽

Unseen Data

In the past decade, there has been a remarkable evolution of convolutional neural networks (CNN) for biomedical image processing. These improvements are inculcated in the basic deep learning-based models for computer-aided detection and prognosis of various ailments. But implementation of these CNN based networks is highly dependent on large data in case of supervised learning processes. This is needed to tackle overfitting issues which is a major concern in supervised techniques. Overfitting refers to the phenomenon when a network starts learning specific patterns of the input such that it fits well on the training data but leads to poor generalization abilities on unseen data. The accessibility of enormous quantity of data limits the field of medical domain research. This paper focuses on utility of data augmentation (DA) techniques, which is a well-recognized solution to the problem of limited data. The experiments were performed on the Brain Tumor Segmentation (BraTS) dataset which is available online. The results signify that different DA approaches have upgraded the accuracies for segmenting brain tumor boundaries using CNN based model.

Download Full-text

Using distant supervision to augment manually annotated data for relation extraction

10.1101/626226 ◽

2019 ◽

Author(s):

Peng Su ◽

Gang Li ◽

Cathy Wu ◽

K. Vijay-Shanker

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Relation Extraction ◽

Biomedical Literature ◽

Training Data ◽

Distant Supervision ◽

Large Size ◽

Domain Expertise

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Download Full-text