N-Gram Collection from a Large-Scale Corpus of Polish Internet

Author(s):  
Szymon Roziewski ◽  
Wojciech Stokowiec ◽  
Antoni Sobkowicz
Keyword(s):  
2019 ◽  
Vol 12 (12) ◽  
pp. 2206-2217
Author(s):  
Qiang Long ◽  
Wei Wang ◽  
Jinfu Deng ◽  
Song Liu ◽  
Wenhao Huang ◽  
...  

2012 ◽  
Vol 38 (3) ◽  
pp. 631-671 ◽  
Author(s):  
Ming Tan ◽  
Wenli Zhou ◽  
Lei Zheng ◽  
Shaojun Wang

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 413
Author(s):  
Andry Alamsyah ◽  
Nidya Dudija ◽  
Sri Widiyanesti

Human online activities leave digital traces that provide a perfect opportunity to understand their behavior better. Social media is an excellent place to spark conversations or state opinions. Thus, it generates large-scale textual data. In this paper, we harness those data to support the effort of personality measurement. Our first contribution is to develop the Big Five personality trait-based model to detect human personalities from their textual data in the Indonesian language. The model uses an ontology approach instead of the more famous machine learning model. The former better captures the meaning and intention of phrases and words in the domain of human personality. The legacy and more thorough ways to assess nature are by doing interviews or by giving questionnaires. Still, there are many real-life applications where we need to possess an alternative method, which is cheaper and faster than the legacy methodology to select individuals based on their personality. The second contribution is to support the model implementation by building a personality measurement platform. We use two distinct features for the model: an n-gram sorting algorithm to parse the textual data and a crowdsourcing mechanism that facilitates public involvement contributing to the ontology corpus addition and filtering.


2020 ◽  
Vol 8 ◽  
pp. 810-827
Author(s):  
Ananya B. Sai ◽  
Akash Kumar Mohankumar ◽  
Siddhartha Arora ◽  
Mitesh M. Khapra

There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives. While model-based metrics perform better than n-gram and embedding based metrics on random negatives, their performance drops substantially when evaluated on adversarial examples. To check if large scale pretraining could help, we propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset. DEB significantly outperforms existing models, showing better correlation with human judgments and better performance on random negatives (88.27% accuracy). However, its performance again drops substantially when evaluated on adversarial responses, thereby highlighting that even large-scale pretrained evaluation models are not robust to the adversarial examples in our dataset. The dataset 1 and code 2 are publicly available.


2017 ◽  
Author(s):  
Hamid Reza Hassanzadeh ◽  
Ying Sha ◽  
May D. Wang

AbstractMultiple cause-of-death data provides a valuable source of information that can be used to enhance health standards by predicting health related trajectories in societies with large populations. These data are often available in large quantities across U.S. states and require Big Data techniques to uncover complex hidden patterns. We design two different classes of models suitable for large-scale analysis of mortality data, a Hadoop-based ensemble of random forests trained over N-grams, and the DeepDeath, a deep classifier based on the recurrent neural network (RNN). We apply both classes to the mortality data provided by the National Center for Health Statistics and show that while both perform significantly better than the random classifier, the deep model that utilizes long short-term memory networks (LSTMs), surpasses the N-gram based models and is capable of learning the temporal aspect of the data without a need for building ad-hoc, expert-driven features.


2018 ◽  
Author(s):  
Nikos Kostagiolas ◽  
Nikiforos Pittaras ◽  
Christoforos Nikolaou ◽  
George Giannakopoulos

Nucleosomes form the first level of DNA compaction and thus bear a critical role in the overall genome organization. At the same time, they modulate chromatin accessibility and, through a dynamic equilibrium with other DNA-binding proteins, may shape gene expression. A number of large-scale nucleosome positioning maps, obtained for various genomes, has compelled the importance of nucleosomes in the regulation of gene expression and has shown constraints in the relative positions of nucleosomes to be much stronger around regulatory elements (i.e. promoters, splice junctions and enhancers). At the same time, the great majority of nucleosome positions appears to be rather flexible. Various computational methods have in the past been used in order to capture the sequence determinants of nucleosome positioning but, as the extent to which DNA sequence preferences may guide nucleosome occupancy largely varies, this has proved to be rather difficult. In order to focus on highly specific sequence attributes, in this work we have analyzed two well-defined sets of nucleosome-occupied sites (NOS) and nucleosome-free-regions (NFR) from the genome of S. cerevisiae, with the use of textual representations. We employed 3 different genomic sequence representations (Hidden Markov Models, Bag-of-Words and N-gram Graphs) combined with a number of machine learning algorithms on the task of classifying genomic sequences as nucleosome-free (NFR) or nucleosome-occupied NOS (to be further amended based on updated results). We found that different approaches that involve the usage of different representations or algorithms can be more or less effective at predicting nucleosome positioning based on the textual data of the underlying genomic sequence. More interestingly, we show that N-gram Graphs, a sequence representation that takes into account both k-mer occurrences and relative positioning at various lengths scales is outperforming other methodologies and may thus be a choice of preference for the analysis of DNA sequences with subtle constraints.


2014 ◽  
Vol 19 (4) ◽  
pp. 919-927 ◽  
Author(s):  
Hiroaki Yamane ◽  
Masafumi Hagiwara
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document