Analyzing Information Leakage of Updates to Natural Language Models

The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence’s grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.

Download Full-text

Pre-Trained Transformer-Based Language Models for Sundanese

10.21203/rs.3.rs-907893/v1 ◽

2021 ◽

Author(s):

Wilson Wongso ◽

Henry Lucky ◽

Derwin Suhartono

Keyword(s):

Natural Language ◽

Text Classification ◽

Training Data ◽

Language Models ◽

Classification Task ◽

Language Understanding ◽

Training Corpus ◽

Low Resource ◽

Corpus Size ◽

Fine Tune

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Download Full-text

Viable Threat on News Reading: Generating Biased News Using Natural Language Models

10.18653/v1/2020.nlpcss-1.7 ◽

2020 ◽

Author(s):

Saurabh Gupta ◽

Hong Huy Nguyen ◽

Junichi Yamagishi ◽

Isao Echizen

Keyword(s):

Natural Language ◽

Language Models

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

10.18653/v1/2021.emnlp-main.167 ◽

2021 ◽

Author(s):

Taichi Iki ◽

Akiko Aizawa

Keyword(s):

Natural Language ◽

Natural Language Understanding ◽

Language Models ◽

Language Understanding ◽

Vision And Language

Download Full-text

Database Tuning using Natural Language Processing

ACM SIGMOD Record ◽

10.1145/3503780.3503788 ◽

2021 ◽

Vol 50 (3) ◽

pp. 27-28

Author(s):

Immanuel Trummer

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Language Models ◽

Learning Approaches ◽

Training Samples ◽

Starting Point ◽

Training Cost ◽

Transformer Model

Introduction. We have seen significant advances in the state of the art in natural language processing (NLP) over the past few years [20]. These advances have been driven by new neural network architectures, in particular the Transformer model [19], as well as the successful application of transfer learning approaches to NLP [13]. Typically, training for specific NLP tasks starts from large language models that have been pre-trained on generic tasks (e.g., predicting obfuscated words in text [5]) for which large amounts of training data are available. Using such models as a starting point reduces task-specific training cost as well as the number of required training samples by orders of magnitude [7]. These advances motivate new use cases for NLP methods in the context of databases.

Download Full-text

Nominalizations in Requirements Engineering Natural Language Models

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch445 ◽

2018 ◽

pp. 5127-5135 ◽

Cited By ~ 1

Author(s):

Claudia S. Litvak ◽

Graciela Dora Susana Hadad ◽

Jorge Horacio Doorn

Keyword(s):

Software Development ◽

Natural Language ◽

Requirements Engineering ◽

Language Models ◽

Application Domain ◽

Engineering Process ◽

Usual Practice ◽

Precise Meaning ◽

Requirements Engineering Process ◽

Nominal Form

It is a usual practice to use natural language in any document intended for clients and users in the requirements engineering process of a software development. This facilitates the comprehension of the requirements engineer's proposals to clients and users. However, natural language introduces some drawbacks, such as ambiguity and incompleteness, which attempt against a good comprehension of those documents. Glossaries help by reducing ambiguity, though they introduce their own linguistic weaknesses. The nominalization of verbs is one of them. There are sometimes appreciable differences between using a verb form or its nominal form, while in other cases they may be synonyms. Therefore, the requirements engineer must be aware of the precise meaning of each term used in the application domain, in order to correctly define them and properly use them in every document. In this chapter, guidelines about treatment of verb nominalization are given when constructing a specific glossary, called Language Extended Lexicon.

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

DEVS Natural Language Models and Elaborations

Guide to Modeling and Simulation of Systems of Systems - Simulation Foundations, Methods and Applications ◽

10.1007/978-0-85729-865-2_4 ◽

2013 ◽

pp. 39-62

Author(s):

Bernard P. Zeigler ◽

Hessam S. Sarjoughian

Keyword(s):

Natural Language ◽

Language Models

Download Full-text

Automatisierte Abbildung semantisch heterogener I4.0-Verwaltungsschalen durch Methoden des Natural Language Processing

at - Automatisierungstechnik ◽

10.1515/auto-2021-0050 ◽

2021 ◽

Vol 69 (11) ◽

pp. 940-951

Author(s):

Maximilian Both ◽

Jochen Müller ◽

Christian Diedrich

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Industrie 4.0 ◽

Language Models ◽

Wird Eine ◽

Iso Standards

Zusammenfassung Systeme im Bereich Industrie 4.0 sollen interoperabel miteinander agieren können. Damit dies automatisiert realisiert werden kann, müssen sie semantisch interoperabel sein. Hierfür fokussiert der aktuelle Industrie 4.0 Forschungsansatz einen semantisch homogenen Sprachraum. In diesem Paper wird eine Methode vorgestellt, die diesen Ansatz um heterogene Semantik erweitert. Die Abbildung unbekannter Vokabulare auf eine Zielontologie ermöglicht die Interaktionen heterogener Verwaltungsschalen. Basis der Abbildung sind Methoden aus dem Bereich Natural Language Processing. Hierzu werden auf ISO Standards vortrainierte language models und sentence embeddings kombiniert. Dies führt zu einer vielversprechenden Genauigkeit bei dem erstellten Evaluationsdatensatz, welcher unterschiedliche Semantiken für Identifikation- und Design-Teilmodelle des Projektes Pumpe 4.0 enthält.

Download Full-text