Large-Scale Distributed Language Modeling

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Download Full-text

Social (distributed) language modeling, clustering and dialectometry

Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing - TextGraphs-4 ◽

10.3115/1708124.1708126 ◽

2009 ◽

Cited By ~ 2

Author(s):

David Ellis

Keyword(s):

Language Modeling ◽

Distributed Language

Download Full-text

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) ◽

10.1109/cahpc.2018.8645935 ◽

2018 ◽

Cited By ~ 5

Author(s):

Raul Puri ◽

Robert Kirby ◽

Nikolai Yakovenko ◽

Bryan Catanzaro

Keyword(s):

Large Scale ◽

Language Modeling

Download Full-text

VetTag: improving automated veterinary diagnosis coding via large-scale language modeling

npj Digital Medicine ◽

10.1038/s41746-019-0113-1 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 2

Author(s):

Yuhui Zhang ◽

Allen Nie ◽

Ashley Zehnder ◽

Rodney L. Page ◽

James Zou

Keyword(s):

Large Scale ◽

Language Modeling ◽

Diagnosis Coding

Download Full-text

Automatic Topic Identification for Large Scale Language Modeling Data Filtering

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-642-23538-2_9 ◽

2011 ◽

pp. 64-71 ◽

Cited By ~ 11

Author(s):

Lucie Skorkovská ◽

Pavel Ircing ◽

Aleš Pražák ◽

Jan Lehečka

Keyword(s):

Large Scale ◽

Language Modeling ◽

Data Filtering ◽

Topic Identification ◽

Modeling Data

Download Full-text

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00360 ◽

2021 ◽

Vol 9 ◽

pp. 176-194

Author(s):

Xiaozhi Wang ◽

Tianyu Gao ◽

Zhaocheng Zhu ◽

Zhengyan Zhang ◽

Zhiyuan Liu ◽

...

Keyword(s):

Link Prediction ◽

Large Scale ◽

State Of The Art ◽

Source Code ◽

Language Modeling ◽

Unified Model ◽

Factual Knowledge ◽

Textual Information ◽

Language Representation ◽

Knowledge Graphs

Abstract Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. In contrast, knowledge embedding (KE) methods can effectively represent the relational facts in knowledge graphs (KGs) with informative entity embeddings, but conventional KE models cannot take full advantage of the abundant textual information. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagERepresentation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs. In KEPLER, we encode textual entity descriptions with a PLM as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks, and also works remarkably well as an inductive KE model on KG link prediction. Furthermore, for pre-training and evaluating KEPLER, we construct Wikidata5M1 , a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. The source code can be obtained from https://github.com/THU-KEG/KEPLER.

Download Full-text

Large-Scale Language Modeling with Random Forests for Mandarin Chinese Speech-to-Text

Advances in Natural Language Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-14770-8_31 ◽

2010 ◽

pp. 269-280 ◽

Cited By ~ 1

Author(s):

Ilya Oparin ◽

Lori Lamel ◽

Jean-Luc Gauvain

Keyword(s):

Mandarin Chinese ◽

Random Forests ◽

Large Scale ◽

Language Modeling

Download Full-text

Distributed language modeling forN-best list re-ranking

Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing - EMNLP '06 ◽

10.3115/1610075.1610108 ◽

2006 ◽

Cited By ~ 16

Author(s):

Ying Zhang ◽

Almut Silja Hildebrand ◽

Stephan Vogel

Keyword(s):

Language Modeling ◽

Distributed Language

Download Full-text

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6795 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11336-11344 ◽

Cited By ~ 1

Author(s):

Gen Li ◽

Nan Duan ◽

Yuejian Fang ◽

Ming Gong ◽

Daxin Jiang

Keyword(s):

Large Scale ◽

State Of The Art ◽

Language Modeling ◽

Context Aware ◽

Commonsense Reasoning ◽

Output Layer ◽

The Cross ◽

Cross Lingual ◽

Vision And Language ◽

Image Caption

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

Download Full-text

Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-642-32790-2_23 ◽

2012 ◽

pp. 191-198 ◽

Cited By ~ 3

Author(s):

Lucie Skorkovská

Keyword(s):

Large Scale ◽

Language Modeling ◽

Data Filtering ◽

Topic Identification ◽

Modeling Data

Download Full-text