scholarly journals Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

2016 ◽  
Vol 4 ◽  
pp. 477-490 ◽  
Author(s):  
Ehsan Shareghi ◽  
Matthias Petri ◽  
Gholamreza Haffari ◽  
Trevor Cohn

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

2016 ◽  
Vol 105 (1) ◽  
pp. 51-61 ◽  
Author(s):  
Jorge Ferrández-Tordera ◽  
Sergio Ortiz-Rojas ◽  
Antonio Toral

Abstract Language models (LMs) are an essential element in statistical approaches to natural language processing for tasks such as speech recognition and machine translation (MT). The advent of big data leads to the availability of massive amounts of data to build LMs, and in fact, for the most prominent languages, using current techniques and hardware, it is not feasible to train LMs with all the data available nowadays. At the same time, it has been shown that the more data is used for a LM the better the performance, e.g. for MT, without any indication yet of reaching a plateau. This paper presents CloudLM, an open-source cloud-based LM intended for MT, which allows to query distributed LMs. CloudLM relies on Apache Solr and provides the functionality of state-of-the-art language modelling (it builds upon KenLM), while allowing to query massive LMs (as the use of local memory is drastically reduced), at the expense of slower decoding speed.


Author(s):  
Ehsan Shareghi ◽  
Gholamreza Haffari ◽  
Trevor Cohn

Hierarchical Pitman-Yor Process priors are compelling for learning language models, outperforming point-estimate based methods. However, these models remain unpopular due to computational and statistical inference issues, such as memory and time usage, as well as poor mixing of sampler. In this work we propose a novel framework which represents the HPYP model compactly using compressed suffix trees. Then, we develop an efficient approximate inference scheme in this framework that has a much lower memory footprint compared to full HPYP and is fast in the inference time. The experimental results illustrate that our model can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and outperforming the perplexity of the state-of-the-art Modified Kneser-Ney count-based LM smoothing by up to 15%.


2021 ◽  
Author(s):  
Roshan Rao ◽  
Jason Liu ◽  
Robert Verkuil ◽  
Joshua Meier ◽  
John F. Canny ◽  
...  

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.


2020 ◽  
Vol 34 (10) ◽  
pp. 13773-13774
Author(s):  
Shumin Deng ◽  
Ningyu Zhang ◽  
Zhanlin Sun ◽  
Jiaoyan Chen ◽  
Huajun Chen

Text classification tends to be difficult when data are deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating implicit common linguistic features across tasks. This paper addresses such problems using meta-learning and unsupervised language models. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. We show that our approach is not only simple but also produces a state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few-shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at https://github.com/zxlzr/FewShotNLP.


2013 ◽  
Vol 21 (2) ◽  
pp. 201-226 ◽  
Author(s):  
DEYI XIONG ◽  
MIN ZHANG

AbstractThe language model is one of the most important knowledge sources for statistical machine translation. In this article, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We introduce algorithms to integrate the two proposed models into two kinds of state-of-the-art phrase-based decoders. Our experimental results on Chinese/Spanish/Vietnamese-to-English show that both models are able to significantly improve translation quality in terms of BLEU and METEOR over a competitive baseline.


2014 ◽  
Vol 102 (1) ◽  
pp. 81-92 ◽  
Author(s):  
Baltescu Paul ◽  
Blunsom Phil ◽  
Hoang Hieu

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).


2021 ◽  
Vol 6 (1) ◽  
pp. 1-4
Author(s):  
Alexander MacLean ◽  
Alexander Wong

The introduction of Bidirectional Encoder Representations from Transformers (BERT) was a major breakthrough for transfer learning in natural language processing, enabling state-of-the-art performance across a large variety of complex language understanding tasks. In the realm of clinical language modeling, the advent of BERT led to the creation of ClinicalBERT, a state-of-the-art deep transformer model pretrained on a wealth of patient clinical notes to facilitate for downstream predictive tasks in the clinical domain. While ClinicalBERT has been widely leveraged by the research community as the foundation for building clinical domain-specific predictive models given its overall improved performance in the Medical Natural Language inference (MedNLI) challenge compared to the seminal BERT model, the fine-grained behaviour and intricacies of this popular clinical language model has not been well-studied. Without this deeper understanding, it is very challenging to understand where ClinicalBERT does well given its additional exposure to clinical knowledge, where it doesn't, and where it can be improved in a meaningful manner. Motivated to garner a deeper understanding, this study presents a critical behaviour exploration of the ClinicalBERT deep transformer model using MedNLI challenge dataset to better understanding the following intricacies: 1) decision-making similarities between ClinicalBERT and BERT (leverage a new metric we introduce called Model Alignment), 2) where ClinicalBERT holds advantages over BERT given its clinical knowledge exposure, and 3) where ClinicalBERT struggles when compared to BERT. The insights gained about the behaviour of ClinicalBERT will help guide towards new directions for designing and training clinical language models in a way that not only addresses the remaining gaps and facilitates for further improvements in clinical language understanding performance, but also highlights the limitation and boundaries of use for such models.


2021 ◽  
Vol 11 (18) ◽  
pp. 8354
Author(s):  
Raymond Ian Osolo ◽  
Zhan Yang ◽  
Jun Long

Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision–language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image–object–location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.


Author(s):  
Linshu Ouyang ◽  
Yongzheng Zhang ◽  
Hui Liu ◽  
Yige Chen ◽  
Yipeng Wang

Authorship verification is an important problem that has many applications. The state-of-the-art deep authorship verification methods typically leverage character-level language models to encode author-specific writing styles. However, they often fail to capture syntactic level patterns, leading to sub-optimal accuracy in cross-topic scenarios. Also, due to imperfect cross-author parameter sharing, it's difficult for them to distinguish author-specific writing style from common patterns, leading to data-inefficient learning. This paper introduces a novel POS-level (Part of Speech) gated RNN based language model to effectively learn the author-specific syntactic styles. The author-agnostic syntactic information obtained from the POS tagger pre-trained on large external datasets greatly reduces the number of effective parameters of our model, enabling the model to learn accurate author-specific syntactic styles with limited training data. We also utilize a gated architecture to learn the common syntactic writing styles with a small set of shared parameters and let the author-specific parameters focus on each author's special syntactic styles. Extensive experimental results show that our method achieves significantly better accuracy than state-of-the-art competing methods, especially in cross-topic scenarios (over 5\% in terms of AUC-ROC).


2021 ◽  
Vol 27 (10) ◽  
pp. 1128-1148
Author(s):  
Hamda Slimi ◽  
Ibrahim Bounhas ◽  
Yahya Slimani

Fake news has invaded social media platforms where false information is being propagated with malicious intent at a fast pace. These circumstances required the development of solutions to monitor and detect rumor in a timely manner. In this paper, we propose an approach that seeks to detect emerging and unseen rumors on Twitter by adapting a pre-trained language model to the task of rumor detection, namely RoBERTa. A comparison against content-based characteristics has shown the capability of the model to surpass handcrafted features. Experimental results show that our approach outperforms state of the art ones in all metrics and that the fine tuning of RoBERTa led to richer word embeddings that consistently and significantly enhance the precision of rumor recognition.


Sign in / Sign up

Export Citation Format

Share Document