statistical language models
Recently Published Documents


TOTAL DOCUMENTS

89
(FIVE YEARS 10)

H-INDEX

14
(FIVE YEARS 0)

2021 ◽  
Vol 3 ◽  
pp. 4
Author(s):  
Tai-Danae Bradley ◽  
Yiannis Vlassopoulos

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.


Author(s):  
Sarra Hasni

The geolocation task of textual data shared on social networks like Twitter attracts a progressive attention. Since those data are supported by advanced geographic information systems for multipurpose spatial analysis, new trends to extend the paradigm of geolocated data become more emergent. Differently from statistical language models that are widely adopted in prior works, the authors propose a new approach that is adopted to the geolocation of both tweets and users through the application of embedding models. The authors boost the geolocation strategy with a sequential modelling using recurrent neural networks to delimit the importance of words in tweets with respect to contextual information. They evaluate the power of this strategy in order to determine locations of unstructured texts that reflect unlimited user's writing styles. Especially, the authors demonstrate that semantic proprieties and word forms can be effective to geolocate texts without specifying local words or topics' descriptions per region.


Author(s):  
M. D. Riazur Rahman ◽  
M. D. Tarek Habib ◽  
M. D. Sadekur Rahman ◽  
Gazi Zahirul Islam ◽  
M. D. Abbas Ali Khan

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 146263-146283
Author(s):  
Willian Antonio dos Santos ◽  
Joao Ribeiro Bezerra ◽  
Luis Fabricio Wanderley Goes ◽  
Flavia Magalhaes Freitas Ferreira

2019 ◽  
Vol 24 (1) ◽  
pp. 98-130
Author(s):  
Martijn Bentum ◽  
Louis ten Bosch ◽  
Antal van den Bosch ◽  
Mirjam Ernestus

Abstract Previous research has demonstrated that language use can vary depending on the context of situation. The present paper extends this finding by comparing word predictability differences between 14 speech registers ranging from highly informal conversations to read-aloud books. We trained 14 statistical language models to compute register-specific word predictability and trained a register classifier on the perplexity score vector of the language models. The classifier distinguishes perfectly between samples from all speech registers and this result generalizes to unseen materials. We show that differences in vocabulary and sentence length cannot explain the speech register classifier’s performance. The combined results show that speech registers differ in word predictability.


Sign in / Sign up

Export Citation Format

Share Document