language modeling Latest Research Papers

System analysis of the problem of modeling a natural language (NL) made it possible to formulate the root cause of the low efficiency of modern means for accumulating and processing knowledge in such languages. This is the complexity of intellectualization for such tools, which are created on the basis of primitive artificial programming languages that practically represent a subset of flectional analytical languages or artificial constructions based on them. To reduce the severity of the identified problem, it is proposed to build NL modeling systems on the basis of technological tools for verbalization and recognition of sense. These tools consist of semiotic models of NL lexical and grammatical means. This approach seems to be especially promising for agglutinative languages; it is supposed to be implemented on the example of the Tatar language.

Download Full-text

Software Systems Security Vulnerabilities Management by Exploring the Capabilities of Language Models Using NLP

Computational Intelligence and Neuroscience ◽

10.1155/2021/8522839 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Raghavendra Rao Althar ◽

Debabrata Samanta ◽

Manjit Kaur ◽

Abeer Ali Alnuaim ◽

Nouf Aljaffan ◽

...

Keyword(s):

Software Development ◽

Data Science ◽

Software Security ◽

Language Modeling ◽

Language Models ◽

Software Systems ◽

Software System ◽

Security Vulnerabilities ◽

Systems Security ◽

Secure Software

Security of the software system is a prime focus area for software development teams. This paper explores some data science methods to build a knowledge management system that can assist the software development team to ensure a secure software system is being developed. Various approaches in this context are explored using data of insurance domain-based software development. These approaches will facilitate an easy understanding of the practical challenges associated with actual-world implementation. This paper also discusses the capabilities of language modeling and its role in the knowledge system. The source code is modeled to build a deep software security analysis model. The proposed model can help software engineers build secure software by assessing the software security during software development time. Extensive experiments show that the proposed models can efficiently explore the software language modeling capabilities to classify software systems’ security vulnerabilities.

Download Full-text

Prior latent distribution comparison for the RNN variational autoencoder in low-resource language modeling

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219243 ◽

2021 ◽

pp. 1-9

Author(s):

Yevhen Kostiuk ◽

Mykola Lukashchuk ◽

Alexander Gelbukh ◽

Grigori Sidorov

Keyword(s):

Normal Distribution ◽

Language Processing ◽

Latent Variables ◽

Random Variable ◽

Language Modeling ◽

Good Choice ◽

Low Resource ◽

Variational Autoencoder ◽

Model Training ◽

Modeling Task

Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.

Download Full-text

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12788 ◽

2021 ◽

Vol 72 ◽

pp. 1343-1384

Author(s):

Vassilina Nikoulina ◽

Maxat Tezekbayev ◽

Nuradil Kozhakhmet ◽

Madina Babazhanova ◽

Matthias Gallé ◽

...

Keyword(s):

Language Modeling ◽

Language Models ◽

Linguistic Knowledge ◽

Necessary Condition ◽

Ongoing Debate ◽

Linguistic Information ◽

Information Theoretic ◽

Modern Language ◽

The Impact ◽

Linguistic Structures

There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the rediscovery hypothesis. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.

Download Full-text

ProteinGLUE: A multi-task benchmark suite for self-supervised protein modeling.

10.1101/2021.12.13.472460 ◽

2021 ◽

Author(s):

Henriette Capel ◽

Robin Weiler ◽

Maurits J.J. Dijkstra ◽

Reinier Vleugels ◽

Peter Bloem ◽

...

Keyword(s):

Secondary Structure ◽

Protein Sequence ◽

Sequence Data ◽

Language Modeling ◽

Performance Evaluations ◽

Property Prediction ◽

Interaction Interface ◽

Protein Sequence Data ◽

Benchmark Suite ◽

Interface Prediction

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue

Download Full-text

Recurrent Neural Networks and Morphological Features in Language Modeling for Serbian

10.1109/telfor52709.2021.9653410 ◽

2021 ◽

Author(s):

Edvin T. Pakoci ◽

Branislav Z. Popovic

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Language Modeling ◽

Morphological Features

Download Full-text

Developing Language-Specific Models Using a Neural Architecture Search

Applied Sciences ◽

10.3390/app112110324 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10324

Author(s):

YongSuk Yoo ◽

Kang-moon Park

Keyword(s):

Word Order ◽

Language Modeling ◽

Future Research ◽

Neural Architecture ◽

Grammaticality Judgment ◽

Korean And English

This paper applies the neural architecture search (NAS) method to Korean and English grammaticality judgment tasks. Based on the previous research, which only discusses the application of NAS on a Korean dataset, we extend the method to English grammatical tasks and compare the resulting two architectures from Korean and English. Since complex syntactic operations exist beneath the word order that is computed, the two different resulting architectures out of the automated NAS language modeling provide an interesting testbed for future research. To the extent of our knowledge, the methodology adopted here has not been tested in the literature. Crucially, the resulting structure of the NAS application shows an unexpected design for human experts. Furthermore, NAS has generated different models for Korean and English, which have different syntactic operations.

Download Full-text

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

10.1109/icacsis53237.2021.9631331 ◽

2021 ◽

Author(s):

Wilson Wongso ◽

David Samuel Setiawan ◽

Derwin Suhartono

Keyword(s):

Language Modeling

Download Full-text

DIBERT: Dependency Injected Bidirectional Encoder Representations from Transformers

10.36227/techrxiv.16444611.v2 ◽

2021 ◽

Author(s):

Abdul Wahab ◽

Rafet Sifa

Keyword(s):

Natural Language ◽

Sentiment Analysis ◽

Semantic Similarity ◽

Syntactic Structure ◽

Language Modeling ◽

Benchmark Dataset ◽

Fine Tuning ◽

New Model ◽

Dependency Tree ◽

Better Than

<div> <div> <div> <p> </p><div> <div> <div> <p>In this paper, we propose a new model named DIBERT which stands for Dependency Injected Bidirectional Encoder Representations from Transformers. DIBERT is a variation of the BERT and has an additional third objective called Parent Prediction (PP) apart from Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). PP injects the syntactic structure of a dependency tree while pre-training the DIBERT which generates syntax-aware generic representations. We use the WikiText-103 benchmark dataset to pre-train both BERT- Base and DIBERT. After fine-tuning, we observe that DIBERT performs better than BERT-Base on various downstream tasks including Semantic Similarity, Natural Language Inference and Sentiment Analysis. </p> </div> </div> </div> </div> </div> </div>

Download Full-text

language modeling
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Generating Vocabulary Sets for Implicit Language Learning using Masked Language Modeling

System analysis of the natural language modeling problem

Software Systems Security Vulnerabilities Management by Exploring the Capabilities of Language Models Using NLP

Prior latent distribution comparison for the RNN variational autoencoder in low-resource language modeling

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

ProteinGLUE: A multi-task benchmark suite for self-supervised protein modeling.

Recurrent Neural Networks and Morphological Features in Language Modeling for Serbian

Developing Language-Specific Models Using a Neural Architecture Search

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

DIBERT: Dependency Injected Bidirectional Encoder Representations from Transformers

Export Citation Format

language modelingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Generating Vocabulary Sets for Implicit Language Learning using Masked Language Modeling

System analysis of the natural language modeling problem

Software Systems Security Vulnerabilities Management by Exploring the Capabilities of Language Models Using NLP

Prior latent distribution comparison for the RNN variational autoencoder in low-resource language modeling

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

ProteinGLUE: A multi-task benchmark suite for self-supervised protein modeling.

Recurrent Neural Networks and Morphological Features in Language Modeling for Serbian

Developing Language-Specific Models Using a Neural Architecture Search

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

DIBERT: Dependency Injected Bidirectional Encoder Representations from Transformers

language modeling
Recently Published Documents