Language Modeling with Reduced Densities

Compositionality ◽

10.32408/compositionality-3-4 ◽

2021 ◽

Vol 3 ◽

pp. 4

Author(s):

Tai-Danae Bradley ◽

Yiannis Vlassopoulos

Keyword(s):

Mathematical Structure ◽

Positive Semidefinite ◽

Fundamental Question ◽

Language Models ◽

Finite Alphabet ◽

Text Data ◽

Enriched Category ◽

Unstructured Text ◽

Statistical Language Models ◽

Categorical Structure

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.

Statistical Language Models for Information Retrieval A Critical Review

10.1561/9781601981875 ◽

2007 ◽

Cited By ~ 4

Author(s):

ChengXiang Zhai

Keyword(s):

Information Retrieval ◽

Critical Review ◽

Language Models ◽

Statistical Language Models

Predicting reading difficulty with statistical language models

Journal of the American Society for Information Science and Technology ◽

10.1002/asi.20243 ◽

2005 ◽

Vol 56 (13) ◽

pp. 1448-1462 ◽

Cited By ~ 63

Author(s):

Kevyn Collins-Thompson ◽

Jamie Callan

Keyword(s):

Language Models ◽

Reading Difficulty ◽

Statistical Language Models

Analysis of unstructured text data for a person social profile

Proceedings of the Internationsl Conference on Electronic Governance and Open Society Challenges in Eurasia - eGose '17 ◽

10.1145/3129757.3129758 ◽

2017 ◽

Cited By ~ 1

Author(s):

Alexey Y. Timonin ◽

Alexander S. Bozhday ◽

Alexander M. Bershadsky

Keyword(s):

Text Data ◽

Unstructured Text

Evaluating Commonsense in Pre-Trained Language Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6523 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9733-9740 ◽

Cited By ~ 1

Author(s):

Xuhui Zhou ◽

Yue Zhang ◽

Leyang Cui ◽

Dandan Huang

Keyword(s):

Reading Comprehension ◽

Question Answering ◽

Deep Level ◽

Language Models ◽

Future Research ◽

Correct Prediction ◽

Test Cases ◽

Word Sense ◽

Training Set ◽

Text Data

Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and RoBERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models' commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research.

Generalized algorithms for constructing statistical language models

10.3115/1075096.1075102 ◽

2003 ◽

Cited By ~ 37

Author(s):

Cyril Allauzen ◽

Mehryar Mohri ◽

Brian Roark

Keyword(s):

Language Models ◽

Statistical Language Models

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Informatica ◽

10.15388/informatica.2004.079 ◽

2004 ◽

Vol 15 (4) ◽

pp. 565-580 ◽

Cited By ~ 1

Author(s):

Airenas Vaičiūnas ◽

Vytautas Kaminskas ◽

Gailius Raškinis

Keyword(s):

Language Models ◽

Morphological Decomposition ◽

Statistical Language Models ◽

Word Clustering

Huffman and Linear Scanning Methods with Statistical Language Models

Augmentative and Alternative Communication ◽

10.3109/07434618.2014.997890 ◽

2015 ◽

Vol 31 (1) ◽

pp. 37-50 ◽

Cited By ~ 8

Author(s):

Brian Roark ◽

Melanie Fried-Oken ◽

Chris Gibbons

Keyword(s):

Language Models ◽

Statistical Language Models ◽

Linear Scanning

Dynamic Web log session identification with statistical language models

Journal of the American Society for Information Science and Technology ◽

10.1002/asi.20084 ◽

2004 ◽

Vol 55 (14) ◽

pp. 1290-1303 ◽

Cited By ~ 37

Author(s):

Xiangji Huang ◽

Fuchun Peng ◽

Aijun An ◽

Dale Schuurmans

Keyword(s):

Language Models ◽

Web Log ◽

Dynamic Web ◽

Statistical Language Models

An Extended System for Labeling Graphical Documents Using Statistical Language Models

Lecture Notes in Computer Science - Graphics Recognition. Ten Years Review and Future Perspectives ◽

10.1007/11767978_6 ◽

2006 ◽

pp. 61-75

Author(s):

Andrew O’Sullivan ◽

Laura Keyes ◽

Adam Winstanley

Keyword(s):

Language Models ◽

Extended System ◽

Statistical Language Models

Incorporating Text OLAP in Business Intelligence

Business Intelligence Applications and the Web - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-61350-038-5.ch004 ◽

2011 ◽

pp. 77-101 ◽

Cited By ~ 1

Author(s):

Byung-Kwon Park ◽

Il-Yeol Song

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Business Intelligence ◽

Multidimensional Analysis ◽

Web Pages ◽

Data Types ◽

Text Documents ◽

Text Data ◽

Platform Architecture ◽

Unstructured Text

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.