A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge

We present a hypothetical argument against finite-state processes in statistical language modeling that is based on semantics rather than syntax. In this theoretical model, we suppose that the semantic properties of texts in a natural language could be approximately captured by a recently introduced concept of a perigraphic process. Perigraphic processes are a class of stochastic processes that satisfy a Zipf-law accumulation of a subset of factual knowledge, which is time-independent, compressed, and effectively inferrable from the process. We show that the classes of finite-state processes and of perigraphic processes are disjoint, and we present a new simple example of perigraphic processes over a finite alphabet called Oracle processes. The disjointness result makes use of the Hilberg condition, i.e., the almost sure power-law growth of algorithmic mutual information. Using a strongly consistent estimator of the number of hidden states, we show that finite-state processes do not satisfy the Hilberg condition whereas Oracle processes satisfy the Hilberg condition via the data-processing inequality. We discuss the relevance of these mathematical results for theoretical and computational linguistics.

Download Full-text

Morphological Theory and Computational Linguistics

The Oxford Handbook of Morphological Theory ◽

10.1093/oxfordhb/9780199668984.013.32 ◽

2018 ◽

pp. 572-593

Author(s):

Vito Pirrelli

Keyword(s):

Computational Linguistics ◽

Word Processing ◽

Computational Models ◽

Lexical Processing ◽

Theoretical Models ◽

Language Models ◽

Test Bed ◽

Theoretical Frameworks ◽

Finite State ◽

Morphological Theory

The chapter provides a computer-based, algorithmic view of issues of lexical processing, ranging from the encoding of input data to the structure of output representations, going through the basic operations of word splitting, storage, access, retrieval, and assembly of intermediate representations. By illustrating the contribution of different computational frameworks (such as finite state automata, hierarchical lexica, artificial neural networks, and statistical language models) to our understanding of aspects of lexical organization, the chapter discusses the implications of theoretical models of morphology for computational models of word processing, as well as the implications of computer models for theoretical issues. In this perspective, much of current work in computational morphology does not only provide a challenging test bed for box and arrow models of lexical knowledge, but it also promises to bridge the persisting gap between theoretical frameworks and behaviourally oriented research in lexical modelling.

Download Full-text

Probabilistic distances between finite-state finite-alphabet hidden Markov models

IEEE Transactions on Automatic Control ◽

10.1109/tac.2005.844896 ◽

2005 ◽

Vol 50 (4) ◽

pp. 505-511 ◽

Cited By ~ 26

Author(s):

Li Xie ◽

V.A. Ugrinovskii ◽

I.R. Petersen

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Finite Alphabet ◽

Probabilistic Distances ◽

Finite State

Download Full-text

Finite-State Technology

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.39 ◽

2018 ◽

Author(s):

Mans Hulden

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Finite State Machines ◽

Regular Languages ◽

Finite State Automata ◽

State Machines ◽

Computational Phonology ◽

Finite State

Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.

Download Full-text

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study (Preprint)

10.2196/preprints.15371 ◽

2019 ◽

Author(s):

Derek Howard ◽

Marta M Maslej ◽

Justin Lee ◽

Jacob Ritchie ◽

Geoffrey Woollard ◽

...

Keyword(s):

Mental Health ◽

Machine Learning ◽

Social Media ◽

Transfer Learning ◽

Computational Linguistics ◽

Feature Representation ◽

Fine Tuning ◽

Language Models ◽

Universal Sentence ◽

Text Feature

BACKGROUND Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods. OBJECTIVE This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. We tested on datasets that contain posts labeled for perceived suicide risk or moderator attention in the context of self-harm. Specifically, we assessed the ability of the methods to prioritize posts that a moderator would identify for immediate response. METHODS We used 1588 labeled posts from the Computational Linguistics and Clinical Psychology (CLPsych) 2017 shared task collected from the Reachout.com forum. Posts were represented using lexicon-based tools, including Valence Aware Dictionary and sEntiment Reasoner, Empath, and Linguistic Inquiry and Word Count, and also using pretrained artificial neural network models, including DeepMoji, Universal Sentence Encoder, and Generative Pretrained Transformer-1 (GPT-1). We used Tree-based Optimization Tool and Auto-Sklearn as AutoML tools to generate classifiers to triage the posts. RESULTS The top-performing system used features derived from the GPT-1 model, which was fine-tuned on over 150,000 unlabeled posts from Reachout.com. Our top system had a macroaveraged F1 score of 0.572, providing a new state-of-the-art result on the CLPsych 2017 task. This was achieved without additional information from metadata or preceding posts. Error analyses revealed that this top system often misses expressions of hopelessness. In addition, we have presented visualizations that aid in the understanding of the learned classifiers. CONCLUSIONS In this study, we found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.

Download Full-text

Reconsidering the Past: Optimizing Hidden States in Language Models

10.18653/v1/2021.findings-emnlp.346 ◽

2021 ◽

Author(s):

Davis Yoshida ◽

Kevin Gimpel

Keyword(s):

Language Models ◽

The Past ◽

Hidden States

Download Full-text

Joint optimization on decoding graphs using minimum classification error criterion

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2014.5 ◽

2014 ◽

Vol 3 ◽

Author(s):

Abdelaziz A. Abdelhamid ◽

Waleed H. Abdulla

Keyword(s):

Likelihood Estimation ◽

Discriminative Training ◽

Language Models ◽

Classification Error ◽

Error Criterion ◽

Speech Corpora ◽

Finite State ◽

Minimum Classification Error ◽

Speech Features ◽

Weighted Finite State Transducers

Motivated by the inherent correlation between the speech features and their lexical words, we propose in this paper a new framework for learning the parameters of the corresponding acoustic and language models jointly. The proposed framework is based on discriminative training of the models' parameters using minimum classification error criterion. To verify the effectiveness of the proposed framework, a set of four large decoding graphs is constructed using weighted finite-state transducers as a composition of two sets of context-dependent acoustic models and two sets of n-gram-based language models. The experimental results conducted on this set of decoding graphs validated the effectiveness of the proposed framework when compared with four baseline systems based on maximum likelihood estimation and separate discriminative training of acoustic and language models in benchmark testing of two speech corpora, namely TIMIT and RM1.

Download Full-text

Automatic training of stochastic finite-state language models for speech understanding

[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.1992.225944 ◽

1992 ◽

Cited By ~ 3

Author(s):

E.P. Giachin

Keyword(s):

Language Models ◽

Speech Understanding ◽

Finite State ◽

State Language

Download Full-text

Rate distortion functions for finite-state finite-alphabet Markov sources

IEEE Transactions on Information Theory ◽

10.1109/tit.1971.1054604 ◽

1971 ◽

Vol 17 (2) ◽

pp. 127-134 ◽

Cited By ~ 38

Author(s):

R. Gray

Keyword(s):

Rate Distortion ◽

Finite Alphabet ◽

Distortion Functions ◽

Finite State ◽

Markov Sources

Download Full-text

A Biscriptual Morphological Transducer for Crimean Tatar

10.33011/computel.v1i.423 ◽

2019 ◽

Author(s):

Francis M. Tyers ◽

Jonathan N. Washington ◽

Darya Kavitskaya ◽

Memduh Gökırmak

Keyword(s):

Computational Linguistics ◽

Morphological Analysis ◽

State Of The Art ◽

Full Range ◽

The State ◽

Loan Words ◽

The Core ◽

Finite State ◽

Morphological Modelling ◽

Crimean Tatar

This paper describes a weighted finite-state morphological transducer for Crimean Tatar able to analyse and generate in both Latin and Cyrillic orthographies. This transducer was developed by a team including a community member and language expert, a field linguist who works with the community, a Turkologist with computational linguistics expertise, and an experienced computational linguist with Turkic expertise. Dealing with two orthographic systems in the same transducer is challenging as they employ different strategies to deal with the spelling of loan words and encode the full range of the language's phonemes and their interaction. We develop the core transducer using the Latin orthography and then design a separate transliteration transducer to map the surface forms to Cyrillic. To help control the non-determinism in the orthographic mapping, we use weights to prioritise forms seen in the corpus. We perform an evaluation of all components of the system, finding an accuracy above 90% for morphological analysis and near 90% for orthographic conversion. This comprises the state of the art for Crimean Tatar morphological modelling, and, to our knowledge, is the first biscriptual single morphological transducer for any language.

Download Full-text

MorphoBr: an open source large-coverage full-form lexicon for morphological analysis of Portuguese

Texto Livre Linguagem e Tecnologia ◽

10.17851/1983-3652.11.3.1-25 ◽

2018 ◽

Vol 11 (3) ◽

pp. 1-25

Author(s):

Leonel Figueiredo de Alencar ◽

Bruno Cuconato ◽

Alexandre Rademaker

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Open Source ◽

Computational Linguistics ◽

Language Processing ◽

Morphological Analysis ◽

Computational Techniques ◽

Processing Technologies ◽

Finite State ◽

Full Form

ABSTRACT: One of the prerequisites for many natural language processing technologies is the availability of large lexical resources. This paper reports on MorphoBr, an ongoing project aiming at building a comprehensive full-form lexicon for morphological analysis of Portuguese. A first version of the resource is already freely available online under an open source, free software license. MorphoBr combines analogous free resources, correcting several thousand errors and gaps, and systematically adding new entries. In comparison to the integrated resources, lexical entries in MorphoBr follow a more user-friendly format, which can be straightforwardly compiled into finite-state transducers for morphological analysis, e.g. in the context of syntactic parsing with a grammar in the LFG formalism using the XLE system. MorphoBr results from a combination of computational techniques. Errors and the more obvious gaps in the integrated resources were automatically corrected with scripts. However, MorphoBr's main contribution is the expansion in the inventory of nouns and adjectives. This was carried out by systematically modeling diminutive formation in the paradigm of finite-state morphology. This allowed MorphoBr to significantly outperform analogous resources in the coverage of diminutives. The first evaluation results show MorphoBr to be a promising initiative which will directly contribute to the development of more robust natural language processing tools and applications which depend on wide-coverage morphological analysis.KEYWORDS: computational linguistics; natural language processing; morphological analysis; full-form lexicon; diminutive formation. RESUMO: Um dos pré-requisitos para muitas tecnologias de processamento de linguagem natural é a disponibilidade de vastos recursos lexicais. Este artigo trata do MorphoBr, um projeto em desenvolvimento voltado para a construção de um léxico de formas plenas abrangente para a análise morfológica do português. Uma primeira versão do recurso já está disponível gratuitamente on-line sob uma licença de software livre e de código aberto. MorphoBr combina recursos livres análogos, corrigindo vários milhares de erros e lacunas. Em comparação com os recursos integrados, as entradas lexicais do MorphoBr seguem um formato mais amigável, o qual pode ser compilado diretamente em transdutores de estados finitos para análise morfológica, por exemplo, no contexto do parsing sintático com uma gramática no formalismo da LFG usando o sistema XLE. MorphoBr resulta de uma combinação de técnicas computacionais. Erros e lacunas mais óbvias nos recursos integrados foram automaticamente corrigidos com scripts. No entanto, a principal contribuição de MorphoBr é a expansão no inventário de substantivos e adjetivos. Isso foi alcançado pela modelação sistemática da formação de diminutivos no paradigma da morfologia de estados finitos. Isso possibilitou a MorphoBr superar de forma significativa recursos análogos na cobertura de diminutivos. Os primeiros resultados de avaliação mostram que o MorphoBr constitui uma iniciativa promissora que contribuirá de forma direta para conferir robustez a ferramentas e aplicações de processamento de linguagem natural que dependem de análise morfológica de ampla cobertura.PALAVRAS-CHAVE: linguística computacional; processamento de linguagem natural; análise morfológica; léxico de formas plenas; formação de diminutivos.

Download Full-text