LanguageCrawl: a generic tool for building language models upon common Crawl

Language Resources and Evaluation ◽

10.1007/s10579-021-09551-7 ◽

2021 ◽

Author(s):

Szymon Roziewski ◽

Marek Kozłowski

Keyword(s):

Language Processing ◽

Deep Neural Networks ◽

Language Model ◽

Language Models ◽

Unstructured Data ◽

Web Pages ◽

Data Intensive ◽

The Common ◽

Internet Community ◽

N Gram

AbstractThe exponential growth of the internet community has resulted in the production of a vast amount of unstructured data, including web pages, blogs and social media. Such a volume consisting of hundreds of billions of words is unlikely to be analyzed by humans. In this work we introduce the tool LanguageCrawl, which allows Natural Language Processing (NLP) researchers to easily build web-scale corpora using the Common Crawl Archive—an open repository of web crawl information, which contains petabytes of data. We present three use cases in the course of this work: filtering of Polish websites, the construction of n-gram corpora and the training of a continuous skipgram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and n-gram ranks. This paper focuses particularly on high computing efficiency by applying highly concurrent multitasking. Our tool utilizes effective libraries and design. LanguageCrawl has been made publicly available to enrich the current set of NLP resources. We strongly believe that our work will facilitate further NLP research, especially in under-resourced languages, in which the lack of appropriately-sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.

Download Full-text

An exploratory research on grammar checking of Bangla sentences using statistical language models

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i3.pp3244-3252 ◽

2020 ◽

Vol 10 (3) ◽

pp. 3244

Author(s):

M. D. Riazur Rahman ◽

M. D. Tarek Habib ◽

M. D. Sadekur Rahman ◽

Gazi Zahirul Islam ◽

M. D. Abbas Ali Khan

Keyword(s):

Language Processing ◽

Language Model ◽

Language Models ◽

Exploratory Research ◽

Smoothing Technique ◽

Comparative Performance ◽

Statistical Language Models ◽

Language Modelling ◽

N Gram ◽

Improved Technique

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.

Download Full-text

INTEGRATION OF n-GRAM LANGUAGE MODELS IN MULTIPLE CLASSIFIER SYSTEMS FOR OFFLINE HANDWRITTEN TEXT LINE RECOGNITION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006855 ◽

2008 ◽

Vol 22 (07) ◽

pp. 1301-1321 ◽

Cited By ~ 2

Author(s):

ROMAN BERTOLAMI ◽

HORST BUNKE

Keyword(s):

Language Model ◽

Language Models ◽

Combination Method ◽

Text Line ◽

Multiple Classifier Systems ◽

Classifier Systems ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Multiple Classifier ◽

N Gram

Current multiple classifier systems for unconstrained handwritten text recognition do not provide a straightforward way to utilize language model information. In this paper, we describe a generic method to integrate a statistical n-gram language model into the combination of multiple offline handwritten text line recognizers. The proposed method first builds a word transition network and then rescores this network with an n-gram language model. Experimental evaluation conducted on a large dataset of offline handwritten text lines shows that the proposed approach improves the recognition accuracy over a reference system as well as over the original combination method that does not include a language model.

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Comparing gated and simple recurrent neural network architectures as models of human sentence processing

10.31234/osf.io/wec74 ◽

2018 ◽

Author(s):

Christoph Aurnhammer ◽

Stefan L. Frank

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Language Model ◽

Cell Types ◽

Recurrent Network ◽

Cognitive Models ◽

Language Models ◽

Model Quality ◽

Sentence Reading ◽

Human Sentence Processing

The Simple Recurrent Network (SRN) has a long tradition in cognitive models of language processing. More recently, gated recurrent networks have been proposed that often outperform the SRN on natural language processing tasks. Here, we investigate whether two types of gated networks perform better as cognitive models of sentence reading than SRNs, beyond their advantage as language models.This will reveal whether the filtering mechanism implemented in gated networks corresponds to an aspect of human sentence processing.We train a series of language models differing only in the cell types of their recurrent layers. We then compute word surprisal values for stimuli used in self-paced reading, eye-tracking, and electroencephalography experiments, and quantify the surprisal values' fit to experimental measures that indicate human sentence reading effort.While the gated networks provide better language models, they do not outperform their SRN counterpart as cognitive models when language model quality is equal across network types. Our results suggest that the different architectures are equally valid as models of human sentence processing.

Download Full-text

WATS-SMS: A T5-Based French Wikipedia Abstractive Text Summarizer for SMS

Future Internet ◽

10.3390/fi13090238 ◽

2021 ◽

Vol 13 (9) ◽

pp. 238

Author(s):

Jean Louis Ebongue Kedieng Fendji ◽

Désiré Manuel Taira ◽

Marcellin Atemkeng ◽

Adam Musa Ali

Keyword(s):

Mobile Devices ◽

Language Processing ◽

Rural Areas ◽

Text Summarization ◽

Web Pages ◽

A Value ◽

Gsm Network ◽

The Common ◽

To Receive ◽

Made In

Text summarization remains a challenging task in the natural language processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones; and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention; and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

USING GRAPHEME n-GRAMS IN SPELLING CORRECTION AND AUGMENTATIVE TYPING SYSTEMS

New Mathematics and Natural Computation ◽

10.1142/s1793005708000970 ◽

2008 ◽

Vol 04 (01) ◽

pp. 87-106

Author(s):

ALKET MEMUSHAJ ◽

TAREK M. SOBH

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Computational Efficiency ◽

Probabilistic Models ◽

Language Modeling ◽

Language Models ◽

Text Corpora ◽

N Gram ◽

Changes Over Time ◽

Over Time

Probabilistic language models have gained popularity in Natural Language Processing due to their ability to successfully capture language structures and constraints with computational efficiency. Probabilistic language models are flexible and easily adapted to language changes over time as well as to some new languages. Probabilistic language models can be trained and their accuracy strongly related to the availability of large text corpora. In this paper, we investigate the usability of grapheme probabilistic models, specifically grapheme n-grams models in spellchecking as well as augmentative typing systems. Grapheme n-gram models require substantially smaller training corpora and that is one of the main drivers for this thesis in which we build grapheme n-gram language models for the Albanian language. There are presently no available Albanian language corpora to be used for probabilistic language modeling. Our technique attempts to augment spellchecking and typing systems by utilizing grapheme n-gram language models in improving suggestion accuracy in spellchecking and augmentative typing systems. Our technique can be implemented in a standalone tool or incorporated in another tool to offer additional selection/scoring criteria.

Download Full-text

Half-Context Language Models

Computational Linguistics ◽

10.1162/coli_a_00078 ◽

2011 ◽

Vol 37 (4) ◽

pp. 843-865

Author(s):

Hinrich Schütze ◽

Michael Walsh

Keyword(s):

Clustering Algorithm ◽

Language Model ◽

Model Performance ◽

Language Models ◽

Fine Grained ◽

Specific Analysis ◽

Distributional Information ◽

N Gram ◽

Context Specific ◽

Context Models

This article investigates the effects of different degrees of contextual granularity on language model performance. It presents a new language model that combines clustering and half-contextualization, a novel representation of contexts. Half-contextualization is based on the half-context hypothesis that states that the distributional characteristics of a word or bigram are best represented by treating its context distribution to the left and right separately and that only directionally relevant distributional information should be used. Clustering is achieved using a new clustering algorithm for class-based language models that compares favorably to the exchange algorithm. When interpolated with a Kneser-Ney model, half-context models are shown to have better perplexity than commonly used interpolated n-gram models and traditional class-based approaches. A novel, fine-grained, context-specific analysis highlights those contexts in which the model performs well and those which are better treated by existing non-class-based models.

Download Full-text