Low-Resource Language Discrimination toward Chinese Dialects with Transfer Learning and Data Augmentation

Fan Xu; Yangjie Dan; Keyu Yan; Yong Ma; Mingwen Wang

doi:10.1145/3473499

Low-Resource Language Discrimination toward Chinese Dialects with Transfer Learning and Data Augmentation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3473499 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-21

Author(s):

Fan Xu ◽

Yangjie Dan ◽

Keyu Yan ◽

Yong Ma ◽

Mingwen Wang

Keyword(s):

Transfer Learning ◽

Language Processing ◽

Data Augmentation ◽

Semantic Representation ◽

Semantic Features ◽

Language Discrimination ◽

Low Resource ◽

Chinese Dialects ◽

Fine Tune ◽

Target Side

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

Download Full-text

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Back Translation

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Download Full-text

Multilingual Automatic Term Extraction in Low-Resource Domains

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128502 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

NGOC TAN LE ◽

Fatiha Sadat

Keyword(s):

Neural Networks ◽

Language Processing ◽

Large Scale ◽

Data Augmentation ◽

Shared Task ◽

Low Resource ◽

Term Extraction ◽

Augmentation Techniques ◽

Automatic Term Extraction ◽

The Neural Networks

With the emergence of the neural networks-based approaches, research on information extraction has benefited from large-scale raw texts by leveraging them using pre-trained embeddings and other data augmentation techniques to deal with challenges and issues in Natural Language Processing tasks. In this paper, we propose an approach using sequence-to-sequence neural networks-based models to deal with term extraction for low-resource domain. Our empirical experiments, evaluating on the multilingual ACTER dataset provided in the LREC-TermEval 2020 shared task on automatic term extraction, proved the efficiency of deep learning approach, in the case of low-data settings, for the automatic term extraction task.

Download Full-text

Dissociable electrophysiological measures of natural language processing reveal differences in speech comprehension strategy in healthy ageing

Scientific Reports ◽

10.1038/s41598-021-84597-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Michael P. Broderick ◽

Giovanni M. Di Liberto ◽

Andrew J. Anderson ◽

Adrià Rofes ◽

Edmund C. Lalor

Keyword(s):

Older Adults ◽

Cognitive Processing ◽

Language Processing ◽

Semantic Representation ◽

Healthy Ageing ◽

Activation Mechanism ◽

Speech Comprehension ◽

Semantic Features ◽

Semantic Level ◽

Older Subjects

AbstractHealthy ageing leads to changes in the brain that impact upon sensory and cognitive processing. It is not fully clear how these changes affect the processing of everyday spoken language. Prediction is thought to play an important role in language comprehension, where information about upcoming words is pre-activated across multiple representational levels. However, evidence from electrophysiology suggests differences in how older and younger adults use context-based predictions, particularly at the level of semantic representation. We investigate these differences during natural speech comprehension by presenting older and younger subjects with continuous, narrative speech while recording their electroencephalogram. We use time-lagged linear regression to test how distinct computational measures of (1) semantic dissimilarity and (2) lexical surprisal are processed in the brains of both groups. Our results reveal dissociable neural correlates of these two measures that suggest differences in how younger and older adults successfully comprehend speech. Specifically, our results suggest that, while younger and older subjects both employ context-based lexical predictions, older subjects are significantly less likely to pre-activate the semantic features relating to upcoming words. Furthermore, across our group of older adults, we show that the weaker the neural signature of this semantic pre-activation mechanism, the lower a subject’s semantic verbal fluency score. We interpret these findings as prediction playing a generally reduced role at a semantic level in the brains of older listeners during speech comprehension and that these changes may be part of an overall strategy to successfully comprehend speech with reduced cognitive resources.

Download Full-text

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00225-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Zolzaya Byambadorj ◽

Ryota Nishimura ◽

Altangerel Ayush ◽

Kengo Ohta ◽

Norihide Kitaoka

Keyword(s):

Transfer Learning ◽

Data Augmentation ◽

Prediction Models ◽

Target Language ◽

Text To Speech ◽

Paired Data ◽

Low Resource ◽

Language Data ◽

Single Speaker ◽

Cross Lingual

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Download Full-text

Dependency parsing of biomedical text with BERT

BMC Bioinformatics ◽

10.1186/s12859-020-03905-8 ◽

2020 ◽

Vol 21 (S23) ◽

Author(s):

Jenna Kanerva ◽

Filip Ginter ◽

Sampo Pyysalo

Keyword(s):

Transfer Learning ◽

Language Processing ◽

State Of The Art ◽

Text Processing ◽

Syntactic Analysis ◽

Biomedical Text ◽

Dependency Parsing ◽

Shared Task ◽

Fine Tune ◽

Selection Of

Abstract Background: Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine. Methods: We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing. Results: We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.

Download Full-text

Decoding Strategies for Improving Low-Resource Machine Translation

Electronics ◽

10.3390/electronics9101562 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1562

Author(s):

Chanjun Park ◽

Yeongwook Yang ◽

Kinam Park ◽

Heuiseok Lim

Keyword(s):

Machine Translation ◽

Language Processing ◽

Data Augmentation ◽

Model Performance ◽

Post Processing ◽

Low Resource ◽

Processing Strategies ◽

N Gram ◽

Unknown Words ◽

New Perspective

Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.

Download Full-text

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

ACM Transactions on Software Engineering and Methodology ◽

10.1145/3485135 ◽

2022 ◽

Vol 31 (2) ◽

pp. 1-34

Author(s):

Patrick Keller ◽

Abdoul Kader Kaboré ◽

Laura Plein ◽

Jacques Klein ◽

Yves Le Traon ◽

...

Keyword(s):

Transfer Learning ◽

Language Processing ◽

State Of The Art ◽

Semantic Representation ◽

Source Code ◽

Visual Representations ◽

Representation Learning ◽

Classification Problem ◽

Semantic Code ◽

Code Clone

Recent successes in training word embeddings for Natural Language Processing ( NLP ) tasks have encouraged a wave of research on representation learning for source code, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximum of program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstract syntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robust or non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visual patterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. We propose the WySiWiM ( ‘ ‘What You See Is What It Means ” ) approach where visual representations of source code are fed into powerful pre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transfer learning. We evaluate the proposed embedding approach on the task of vulnerable code prediction in source code and on two variations of the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (a multi-classification problem). We show with experiments on the BigCloneBench (Java), Open Judge (C) that although simple, our WySiWiM approach performs as effectively as state-of-the-art approaches such as ASTNN or TBCNN. We also showed with data from NVD and SARD that WySiWiM representation can be used to learn a vulnerable code detector with reasonable performance (accuracy ∼90%). We further explore the influence of different steps in our approach, such as the choice of visual representations or the classification algorithm, to eventually discuss the promises and limitations of this research direction.

Download Full-text

Towards Improving Open Student Answer Assessment using Pretrained Transformers

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128483 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Nisrine Ait Khayi ◽

Vasile Rus ◽

Lasang Tamang

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Transfer Learning ◽

Language Processing ◽

Text Classification ◽

Question Answering ◽

State Of The Art ◽

Language Models ◽

Assessment Task ◽

Fine Tune

The transfer learning pretraining-finetuning paradigm has revolutionized the natural language processing field yielding state-of the art results in several subfields such as text classification and question answering. However, little work has been done investigating pretrained language models for the open student answer assessment task. In this paper, we fine tune pretrained T5, BERT, RoBERTa, DistilBERT, ALBERT and XLNet models on the DT-Grade dataset which contains freely generated (or open) student answers together with judgment of their correctness. The experimental results demonstrated the effectiveness of these models based on the transfer learning pretraining-finetuning paradigm for open student answer assessment. An improvement of 8%-15% in accuracy was obtained over previous methods. Particularly, a T5 based method led to state-of-the-art results with an accuracy and F1 score of 0.88.

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text

Dissociable electrophysiological measures of natural language processing reveal differences in speech comprehension strategy in healthy ageing

10.1101/2020.04.17.046201 ◽

2020 ◽

Cited By ~ 2

Author(s):

Michael P. Broderick ◽

Giovanni M. Di Liberto ◽

Andrew J. Anderson ◽

Adrià Rofes ◽

Edmund C. Lalor

Keyword(s):

Older Adults ◽

Cognitive Processing ◽

Language Processing ◽

Semantic Representation ◽

Healthy Ageing ◽

Activation Mechanism ◽

Speech Comprehension ◽

Semantic Features ◽

Semantic Level ◽

Older Subjects

AbstractHealthy ageing leads to changes in the brain that impact upon sensory and cognitive processing. It is not fully clear how these changes affect the processing of everyday spoken language. Prediction is thought to play an important role in language comprehension, where information about upcoming words is pre-activated across multiple representational levels. However, evidence from electrophysiology suggests differences in how older and younger adults use context-based predictions, particularly at the level of semantic representation. We investigate these differences during natural speech comprehension by presenting older and younger subjects with continuous, narrative speech while recording their electroencephalogram. We use linear regression to test how distinct computational measures of 1) semantic dissimilarity and 2) lexical surprisal are processed in the brains of both groups. Our results reveal dissociable neural correlates of these two measures that suggest differences in how younger and older adults successfully comprehend speech. Specifically, our results suggest that, while younger and older subjects both employ context-based lexical predictions, older subjects are significantly less likely to pre-activate the semantic features relating to upcoming words. Furthermore, across our group of older adults, we show that the weaker the neural signature of this semantic pre-activation mechanism, the lower a subject’s semantic verbal fluency score. We interpret these findings as prediction playing a generally reduced role at a semantic level in the brains of older listeners during speech comprehension and that these changes may be part of an overall strategy to successfully comprehend speech with reduced cognitive resources.

Download Full-text