Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Applied Sciences ◽

10.3390/app11051974 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1974 ◽

Cited By ~ 1

Author(s):

Chanhee Lee ◽

Kisu Yang ◽

Taesun Whang ◽

Chanjun Park ◽

Andrew Matteson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Low Resource ◽

High Resource ◽

Cross Lingual ◽

Data Efficiency

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Download Full-text

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Low-Resource Named Entity Recognition via the Pre-Training Model

Symmetry ◽

10.3390/sym13050786 ◽

2021 ◽

Vol 13 (5) ◽

pp. 786

Author(s):

Siqi Chen ◽

Yijie Pei ◽

Zunwang Ke ◽

Wushour Silamu

Keyword(s):

Data Augmentation ◽

Language Model ◽

Named Entity Recognition ◽

Name Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Low Resource ◽

Named Entity ◽

High Resource

Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.

Download Full-text

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

10.21203/rs.3.rs-1061146/v1 ◽

2021 ◽

Author(s):

Dezhou Shen

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Classification Task ◽

Accuracy Rate ◽

Basic Model ◽

Transformer Model ◽

Effective Models ◽

Param Eters

Abstract Recent work in language modeling has shown that train- ing large-scale Transformer models has promoted the lat- est developments in natural language processing applica- tions. However, there is very little work to unify the cur- rent effective models. In this work, we use the current ef- fective model structure to launch a model set through the current most mainstream technology. We think this will become the basic model in the future. For Chinese, us- ing the GPT-2[9] model, a 10.3 billion parameter language model was trained on the Chinese dataset, and, in particu- lar, a 2.9 billion parameter language model based on dia- logue data was trained; the BERT model was trained on the Chinese dataset with 495 million parameters; the Trans- former model has trained a language model with 5.6 bil- lion parameters on the Chinese dataset. In English, cor- responding training work has also been done. Using the GPT-2 model, a language model with 6.4 billion param- eters was trained on the English dataset; the BERT[3] model trained a language model with 1.24 billion param- eters on the English dataset, and in particular, it trained a 688 million parameter based on single card training tech- nology Language model; Transformer model trained a lan- guage model with 5.6 billion parameters on the English dataset. In the TNEWS classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46% accu- racy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of 0.53%. In the QQP classification task evalu- ated by GLUE[11], the accuracy rate of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of 6.85%. Compared with the current accuracy rate of ERNIE, the first place in the GLUE evaluation of 75.2%, an increase of 3.75%.

Download Full-text

Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual WSD

Journal of Artificial Intelligence Research ◽

10.1613/jair.4968 ◽

2016 ◽

Vol 56 ◽

pp. 61-87 ◽

Cited By ~ 5

Author(s):

Nasrin Taghizadeh ◽

Hesham Faili

Keyword(s):

Language Processing ◽

Semantic Processing ◽

Large Scale ◽

Word Sense Disambiguation ◽

Expectation Maximization Algorithm ◽

Word Sense ◽

Low Resource ◽

Persian Language ◽

Sense Disambiguation ◽

Cross Lingual

‎Wordnets are an effective resource for natural language processing and information retrieval‎, ‎especially for semantic processing and meaning related tasks‎. ‎So far‎, ‎wordnets have been constructed for many languages‎. ‎However‎, ‎the automatic development of wordnets for low-resource languages has not been well studied‎. ‎In this paper‎, ‎an Expectation-Maximization algorithm is used to create high quality and large scale wordnets for poor-resource languages‎. ‎The proposed method benefits from possessing cross-lingual word sense disambiguation and develops a wordnet by only using a bi-lingual dictionary and a mono-lingual corpus‎. ‎The proposed method has been executed with Persian language and the resulting wordnet has been evaluated through several experiments‎. ‎The results show that the induced wordnet has a precision score of 90% and a recall score of 35%‎.

Download Full-text

Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations

PLoS ONE ◽

10.1371/journal.pone.0257230 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257230

Author(s):

Huijiong Yan ◽

Tao Qian ◽

Liang Xie ◽

Shanguang Chen

Keyword(s):

Language Processing ◽

Large Scale ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Competitive Performance ◽

Named Entity ◽

Model Transfer ◽

The Cross ◽

Cross Lingual

Named entity recognition (NER) is one fundamental task in the natural language processing (NLP) community. Supervised neural network models based on contextualized word representations can achieve highly-competitive performance, which requires a large-scale manually-annotated corpus for training. While for the resource-scarce languages, the construction of such as corpus is always expensive and time-consuming. Thus, unsupervised cross-lingual transfer is one good solution to address the problem. In this work, we investigate the unsupervised cross-lingual NER with model transfer based on contextualized word representations, which greatly advances the cross-lingual NER performance. We study several model transfer settings of the unsupervised cross-lingual NER, including (1) different types of the pretrained transformer-based language models as input, (2) the exploration strategies of the multilingual contextualized word representations, and (3) multi-source adaption. In particular, we propose an adapter-based word representation method combining with parameter generation network (PGN) better to capture the relationship between the source and target languages. We conduct experiments on a benchmark ConLL dataset involving four languages to simulate the cross-lingual setting. Results show that we can obtain highly-competitive performance by cross-lingual model transfer. In particular, our proposed adapter-based PGN model can lead to significant improvements for cross-lingual NER.

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text

Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

Applied Sciences ◽

10.3390/app11146584 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6584

Author(s):

Sugyeong Eo ◽

Chanjun Park ◽

Hyeonseok Moon ◽

Jaehyung Seo ◽

Heuiseok Lim

Keyword(s):

Machine Translation ◽

Large Scale ◽

Data Augmentation ◽

Language Model ◽

Performance Comparison ◽

Language Models ◽

Shared Task ◽

Quality Estimation ◽

Using Data ◽

Cross Lingual

Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text