Detecting computer-generated disinformation

International Journal of Data Science and Analytics ◽

10.1007/s41060-021-00299-5 ◽

2021 ◽

Author(s):

Harald Stiff ◽

Fredrik Johansson

Keyword(s):

Test Data ◽

Large Scale ◽

Language Model ◽

Research Literature ◽

Language Models ◽

Detection Algorithms ◽

Current State ◽

Information Operations ◽

In The Wild ◽

Textual Content

AbstractModern neural language models can be used by malicious actors to automatically produce textual content looking as it has been written by genuine human users. Due to progress in the controllability of computer-generated text, there is a risk that state-sponsored actors may start using such methods for conducting large-scale information operations. Various detection algorithms have been suggested in the research literature to identify texts produced by language model-based generators, but these are often mainly evaluated on test data from the same distribution as they have been trained on. We evaluate promising Transformer-based detection algorithms in a large variety of experiments involving both in-distribution and out-of-distribution test data, as well as evaluation on more realistic in-the-wild data. It is shown that the generalizability of the detectors can be questioned, especially when applied to short social media posts. Moreover, the best performing (RoBERTa-based) detector is shown to be non-robust also to basic adversarial attacks, illustrating how easy it is for malicious actors to avoid detection by the current state-of-the-art detection algorithms.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

Pushdown Automata in Statistical Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00197 ◽

2014 ◽

Vol 40 (3) ◽

pp. 687-723 ◽

Cited By ~ 3

Author(s):

Cyril Allauzen ◽

Bill Byrne ◽

Adrià de Gispert ◽

Gonzalo Iglesias ◽

Michael Riley

Keyword(s):

Machine Translation ◽

Large Scale ◽

Complexity Analysis ◽

Statistical Machine Translation ◽

Language Model ◽

General Purpose ◽

Language Models ◽

Experimental Conditions ◽

Context Free ◽

Pushdown Automata

This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT.

Download Full-text

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Three Translators in Search of an Author: Linguistic Strategies and Language Models in the (Re)translation of Shakespeare’s Plays into Catalan

Multicultural Shakespeare ◽

10.1515/mstap-2017-0018 ◽

2017 ◽

Vol 16 (31) ◽

pp. 41-59

Author(s):

Dídac Pujol

Keyword(s):

20Th Century ◽

19Th Century ◽

Language Model ◽

Language Models ◽

Early 20Th Century ◽

Literary Language ◽

Late 19Th Century ◽

The Tempest ◽

Current State ◽

Linguistic Strategies

This article shows how the language of Shakespeare’s plays has been rendered into Catalan in three especially significant periods: the late 19th century, the early 20th century, and the late 20th and early 21st centuries. The first section centres on the contrast between natural and unnatural language in Hamlet, and considers how this differentiation is carried out (by linguistic techniques that differ substantially from Shakespeare’s) in a late 19th-century Catalan adaptation by Gaietà Soler. The second part of the article investigates the reasons why in an early 20th-century translation of King Lear the translator, Anfòs Par, resorts to medieval instead of present-time language. The last section of the article illustrates how and explores the motivations why Salvador Oliva’s first (1985) version of The Tempest is retranslated in 2006 using a different language model. The ultimate aim of the paper is to put forward the hypothesis that, in the case of Catalan, Shakespearean translations are both a reflection of the current state of the language and a major linguistic experimentation that shapes and creates (sometimes through a via negativa) the Catalan literary language.

Download Full-text

Enriching contextualized language model from knowledge graph for biomedical information extraction

Briefings in Bioinformatics ◽

10.1093/bib/bbaa110 ◽

2020 ◽

Author(s):

Hao Fei ◽

Yafeng Ren ◽

Yue Zhang ◽

Donghong Ji ◽

Xiaohui Liang

Keyword(s):

Information Extraction ◽

Large Scale ◽

Language Model ◽

Relation Extraction ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Training Procedure ◽

Biomedical Knowledge ◽

Biomedical Texts

Abstract Biomedical information extraction (BioIE) is an important task. The aim is to analyze biomedical texts and extract structured information such as named entities and semantic relations between them. In recent years, pre-trained language models have largely improved the performance of BioIE. However, they neglect to incorporate external structural knowledge, which can provide rich factual information to support the underlying understanding and reasoning for biomedical information extraction. In this paper, we first evaluate current extraction methods, including vanilla neural networks, general language models and pre-trained contextualized language models on biomedical information extraction tasks, including named entity recognition, relation extraction and event extraction. We then propose to enrich a contextualized language model by integrating a large scale of biomedical knowledge graphs (namely, BioKGLM). In order to effectively encode knowledge, we explore a three-stage training procedure and introduce different fusion strategies to facilitate knowledge injection. Experimental results on multiple tasks show that BioKGLM consistently outperforms state-of-the-art extraction models. A further analysis proves that BioKGLM can capture the underlying relations between biomedical knowledge concepts, which are crucial for BioIE.

Download Full-text

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/508 ◽

2020 ◽

Author(s):

Juntao Li ◽

Ruidan He ◽

Hai Ye ◽

Hwee Tou Ng ◽

Lidong Bing ◽

...

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Low Resource ◽

Performance Improvements ◽

Domain Specific ◽

High Resource ◽

Significant Performance ◽

Cross Lingual

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

10.21203/rs.3.rs-1061146/v1 ◽

2021 ◽

Author(s):

Dezhou Shen

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Classification Task ◽

Accuracy Rate ◽

Basic Model ◽

Transformer Model ◽

Effective Models ◽

Param Eters

Abstract Recent work in language modeling has shown that train- ing large-scale Transformer models has promoted the lat- est developments in natural language processing applica- tions. However, there is very little work to unify the cur- rent effective models. In this work, we use the current ef- fective model structure to launch a model set through the current most mainstream technology. We think this will become the basic model in the future. For Chinese, us- ing the GPT-2[9] model, a 10.3 billion parameter language model was trained on the Chinese dataset, and, in particu- lar, a 2.9 billion parameter language model based on dia- logue data was trained; the BERT model was trained on the Chinese dataset with 495 million parameters; the Trans- former model has trained a language model with 5.6 bil- lion parameters on the Chinese dataset. In English, cor- responding training work has also been done. Using the GPT-2 model, a language model with 6.4 billion param- eters was trained on the English dataset; the BERT[3] model trained a language model with 1.24 billion param- eters on the English dataset, and in particular, it trained a 688 million parameter based on single card training tech- nology Language model; Transformer model trained a lan- guage model with 5.6 billion parameters on the English dataset. In the TNEWS classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46% accu- racy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of 0.53%. In the QQP classification task evalu- ated by GLUE[11], the accuracy rate of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of 6.85%. Compared with the current accuracy rate of ERNIE, the first place in the GLUE evaluation of 75.2%, an increase of 3.75%.

Download Full-text

Eliciting Attribute-Level User Needs from Online Reviews with Deep Language Models and Information Extraction

Journal of Mechanical Design ◽

10.1115/1.4048819 ◽

2020 ◽

pp. 1-34

Author(s):

Yi Han ◽

Mohsen Moghaddam

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

User Behavior ◽

Language Model ◽

Named Entity Recognition ◽

Online Reviews ◽

Entity Recognition ◽

Language Models ◽

Attribute Level ◽

User Needs

Abstract Eliciting user needs for individual components and features of a product or a service on a large scale is a key requirement for innovative design. Gathering and analyzing data as an initial discovery phase of a design process is usually accomplished with a small number of participants, employing qualitative research methods such as observations, focus groups, and interviews. This leaves an entire swath of pertinent user behavior, preferences, and opinions not captured. Sentiment analysis is a key enabler for large-scale need finding from online user reviews generated on a regular basis. A major limitation of current sentiment analysis approaches used in design sciences, however, is the need for laborious labeling and annotation of large review datasets for training, which in turn hinders their scalability and transferability across different domains. This article proposes an efficient and scalable methodology for automated and large-scale elicitation of attribute-level user needs. The methodology builds on the state-of-the-art pretrained deep language model, BERT (Bidirectional Encoder Representations from Transformers), with new convolutional net and named-entity recognition (NER) layers for extracting attribute, description, and sentiment words from online user review corpora. The machine translation algorithm BLEU (BiLingual Evaluation Understudy) is utilized to extract need expressions in the form of predefined part-of-speech combinations (e.g., adjective-noun, verb-noun). Numerical experiments are conducted on a large dataset scraped from a major e-commerce retail store for apparel and footwear to demonstrate the performance, feasibility, and potentials of the developed methodology.

Download Full-text

A HYBRID LANGUAGE MODEL BASED ON STATISTICS AND LINGUISTIC RULES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001405003934 ◽

2005 ◽

Vol 19 (01) ◽

pp. 109-128 ◽

Cited By ~ 2

Author(s):

XIAOLONG WANG ◽

DANIEL S. YEUNG ◽

JAMES N. K. LIU ◽

ROBERT LUK ◽

XUAN WANG

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Long Distance ◽

Language Understanding ◽

Linguistic Rules ◽

Hybrid Language ◽

N Gram

Language modeling is a current research topic in many domains including speech recognition, optical character recognition, handwriting recognition, machine translation and spelling correction. There are two main types of language models, the mathematical and the linguistic. The most widely used mathematical language model is the n-gram model inferred from statistics. This model has three problems: long distance restriction, recursive nature and partial language understanding. Language models based on linguistics present many difficulties when applied to large scale real texts. We present here a new hybrid language model that combines the advantages of the n-gram statistical language model with those of a linguistic language model which makes use of grammatical or semantic rules. Using suitable rules, this hybrid model can solve problems such as long distance restriction, recursive nature and partial language understanding. The new language model has been effective in experiments and has been incorporated in Chinese sentence input products for Windows and Macintosh OS.

Download Full-text

Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

Applied Sciences ◽

10.3390/app11146584 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6584

Author(s):

Sugyeong Eo ◽

Chanjun Park ◽

Hyeonseok Moon ◽

Jaehyung Seo ◽

Heuiseok Lim

Keyword(s):

Machine Translation ◽

Large Scale ◽

Data Augmentation ◽

Language Model ◽

Performance Comparison ◽

Language Models ◽

Shared Task ◽

Quality Estimation ◽

Using Data ◽

Cross Lingual

Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa.

Download Full-text