language modeling
Recently Published Documents


TOTAL DOCUMENTS

904
(FIVE YEARS 224)

H-INDEX

39
(FIVE YEARS 7)

2021 ◽  
Vol 12 (5-2021) ◽  
pp. 57-66
Author(s):  
Dzavdet Sh. Suleimanov ◽  
◽  
Alexander Ya. Fridman ◽  
Rinat A. Gilmullin ◽  
Boris A. Kulik ◽  
...  

System analysis of the problem of modeling a natural language (NL) made it possible to formulate the root cause of the low efficiency of modern means for accumulating and processing knowledge in such languages. This is the complexity of intellectualization for such tools, which are created on the basis of primitive artificial programming languages that practically represent a subset of flectional analytical languages or artificial constructions based on them. To reduce the severity of the identified problem, it is proposed to build NL modeling systems on the basis of technological tools for verbalization and recognition of sense. These tools consist of semiotic models of NL lexical and grammatical means. This approach seems to be especially promising for agglutinative languages; it is supposed to be implemented on the example of the Tatar language.


2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Raghavendra Rao Althar ◽  
Debabrata Samanta ◽  
Manjit Kaur ◽  
Abeer Ali Alnuaim ◽  
Nouf Aljaffan ◽  
...  

Security of the software system is a prime focus area for software development teams. This paper explores some data science methods to build a knowledge management system that can assist the software development team to ensure a secure software system is being developed. Various approaches in this context are explored using data of insurance domain-based software development. These approaches will facilitate an easy understanding of the practical challenges associated with actual-world implementation. This paper also discusses the capabilities of language modeling and its role in the knowledge system. The source code is modeled to build a deep software security analysis model. The proposed model can help software engineers build secure software by assessing the software security during software development time. Extensive experiments show that the proposed models can efficiently explore the software language modeling capabilities to classify software systems’ security vulnerabilities.


Author(s):  
Yevhen Kostiuk ◽  
Mykola Lukashchuk ◽  
Alexander Gelbukh ◽  
Grigori Sidorov

Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.


2021 ◽  
Vol 72 ◽  
pp. 1343-1384
Author(s):  
Vassilina Nikoulina ◽  
Maxat Tezekbayev ◽  
Nuradil Kozhakhmet ◽  
Madina Babazhanova ◽  
Matthias Gallé ◽  
...  

There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the rediscovery hypothesis. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.


2021 ◽  
Author(s):  
Henriette Capel ◽  
Robin Weiler ◽  
Maurits J.J. Dijkstra ◽  
Reinier Vleugels ◽  
Peter Bloem ◽  
...  

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue


2021 ◽  
Vol 11 (21) ◽  
pp. 10324
Author(s):  
YongSuk Yoo ◽  
Kang-moon Park

This paper applies the neural architecture search (NAS) method to Korean and English grammaticality judgment tasks. Based on the previous research, which only discusses the application of NAS on a Korean dataset, we extend the method to English grammatical tasks and compare the resulting two architectures from Korean and English. Since complex syntactic operations exist beneath the word order that is computed, the two different resulting architectures out of the automated NAS language modeling provide an interesting testbed for future research. To the extent of our knowledge, the methodology adopted here has not been tested in the literature. Crucially, the resulting structure of the NAS application shows an unexpected design for human experts. Furthermore, NAS has generated different models for Korean and English, which have different syntactic operations.


2021 ◽  
Author(s):  
Abdul Wahab ◽  
Rafet Sifa

<div> <div> <div> <p> </p><div> <div> <div> <p>In this paper, we propose a new model named DIBERT which stands for Dependency Injected Bidirectional Encoder Representations from Transformers. DIBERT is a variation of the BERT and has an additional third objective called Parent Prediction (PP) apart from Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). PP injects the syntactic structure of a dependency tree while pre-training the DIBERT which generates syntax-aware generic representations. We use the WikiText-103 benchmark dataset to pre-train both BERT- Base and DIBERT. After fine-tuning, we observe that DIBERT performs better than BERT-Base on various downstream tasks including Semantic Similarity, Natural Language Inference and Sentiment Analysis. </p> </div> </div> </div> </div> </div> </div>


Sign in / Sign up

Export Citation Format

Share Document