Cumulative Frequency Can Explain Cognate Facilitation in Language Models

Mapping Intimacies ◽

10.31234/osf.io/e2vft ◽

2021 ◽

Author(s):

Irene Elisabeth Winther ◽

Yevgen Matusevych ◽

Martin John Pickering

Keyword(s):

Computational Modeling ◽

Language Model ◽

Mental Lexicon ◽

Language Models ◽

Cumulative Frequency ◽

Consistent Finding ◽

Bilingual Speakers

Cognates – words which share form and meaning across two languages – have been extensively studied to understand the bilingual mental lexicon. One consistent finding is that bilingual speakers process cognates faster than non-cognates, an effect known as cognate facilitation. Yet, there is no agreement on the underlying factors driving this effect. In this paper, we use computational modeling to test whether the effect can be explained by the cumulative frequency hypothesis. We train a computational language model on two language pairs (Dutch–English, Norwegian–English) under different conditions of input presentation and test it on sentence stimuli from two existing studies with bilingual speakers of those languages. We find that our model can exhibit a cognate effect, lending support to the cumulative frequency hypothesis. Further analyses reveal that the size of the effect in the model depends on its linguistic accuracy. We interpret our results within the literature on cognate processing.

Download Full-text

INTEGRATION OF n-GRAM LANGUAGE MODELS IN MULTIPLE CLASSIFIER SYSTEMS FOR OFFLINE HANDWRITTEN TEXT LINE RECOGNITION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006855 ◽

2008 ◽

Vol 22 (07) ◽

pp. 1301-1321 ◽

Cited By ~ 2

Author(s):

ROMAN BERTOLAMI ◽

HORST BUNKE

Keyword(s):

Language Model ◽

Language Models ◽

Combination Method ◽

Text Line ◽

Multiple Classifier Systems ◽

Classifier Systems ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Multiple Classifier ◽

N Gram

Current multiple classifier systems for unconstrained handwritten text recognition do not provide a straightforward way to utilize language model information. In this paper, we describe a generic method to integrate a statistical n-gram language model into the combination of multiple offline handwritten text line recognizers. The proposed method first builds a word transition network and then rescores this network with an n-gram language model. Experimental evaluation conducted on a large dataset of offline handwritten text lines shows that the proposed approach improves the recognition accuracy over a reference system as well as over the original combination method that does not include a language model.

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

HATIMU AISYAH KARYA ZURINAH HASSAN MENERUSI PERSPEKTIF ELAINE SHOWALTER MODEL BAHASA

International Journal of Creative Future and Heritage (TENIAT) ◽

10.47252/teniat.v8i2.296 ◽

2020 ◽

Vol 8 (2) ◽

pp. 54-62

Author(s):

NUR ZALIKHA MAT RADZI ◽

NASIRIN ABDILLAH ◽

DAENG HALIZA DAENG JAMAL

Keyword(s):

Language Model ◽

Southeast Asian ◽

Language Models ◽

Symbolic Language ◽

Female Characters ◽

Literary Works ◽

The Past ◽

Historical Practices

Hatimu Aisyah karya Sasterawan Negara ke-13 iaitu - Zurinah Hassan, yang juga penerima Anugerah Hadiah Penulis Asia Tenggara (SEA Write Award) pada tahun 2004. Rentetan kejayaan beliau, telah menjadi tumpuan para pengkaji untuk meneliti aspek mengenai pengarangan wanita. Hatimu Aisyah merupakan novel pertama dihasilkan oleh Zurinah Hassan yang menekankan mengenai amalan adat resam zaman terdahulu sehingga ditelan arus pemodenan zaman. Novel Hatimu Aisyah mengetengahkan gambaran wanita yang mengutamakan adat dalam konteks perjalanan hidup bermasyarakat. Kajian terhadap karya Zurinah Hassan ini, bersandarkan kepada Model Bahasa Gagasan Elaine Showalter dari perspektif ginokritik untuk melihat watak-watak wanita. Antara Perbincangan dalam kajian ini adalah berfokuskan kepada simbolik bahasa dan bahasa sebagai ekspresi kesedaran wanita. Hasil dapatan keseluruhan kajian menunjukkan bahawa Zurinah Hassan menggunakan bahasa yang bersesuaian dengan gagasan bahasa daripada Elaine Showalter tetapi agak kurang menyerlah. Hal ini disebabkan keterbatasan penggunaan bahasa selaras dengan sosiobudaya masyarakat Melayu. Penemuan kajian ini dalam model bahasa wanita dapat dilihat menerusi simbolik bahasa dan bahasa sebagai ekspresi kesedaran wanita. Hasil manfaat dan kepentingan diperolehi masa hadapan dapat dilihat bahawa golongan wanita menzahirkan protes dan kritikan menerusi corak penulisan karya mereka meskipun masih dalam keadaan terkawal. Hatimu Aisyah the 13th National literary works, namely-Zurinah Hassan, who is also the recipient of the Southeast Asian Writer award (SEA Write Award) in 2004. His success string has been the focus of researchers to examine the aspects of women's writings. Hatimu Aisyah is the first novel to be produced by Zurinah Hassan that emphasizes on the historical practices of the past, having swallowed the current modernization of the day. The Hatimu Aisyah Novel highlights the portrayal of women who are customcentric in the context of the communities life. Studies on Zurinah Hassan's work are based on the language Model of Elaine Showalter from the perspective of Ginokritik to see the female characters. Among the discussions in this study are focused on symbolic language and language as a expression of women's awareness. The overall findings of the study showed that Zurinah Hassan used a language that fits the language idea of Elaine Showalter but was somewhat less striking. This is due to the limitations of usage in line with the Malay social. The findings of this study in female language models can be seen through the symbolic language and language in the expression of women's awareness. The results of the benefits and interests gained future can be seen that women are in their protest and criticism through their work writing patterns despite being controlled.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Chord-aware automatic music transcription based on hierarchical Bayesian integration of acoustic and language models

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2018.17 ◽

2018 ◽

Vol 7 ◽

Author(s):

Yuta Ojima ◽

Eita Nakamura ◽

Katsutoshi Itoyama ◽

Kazuyoshi Yoshii

Keyword(s):

Latent Variables ◽

Language Model ◽

Language Models ◽

Sequential Dependency ◽

Acoustic Model ◽

Hierarchical Bayesian ◽

Generative Process ◽

Music Transcription ◽

Automatic Music Transcription ◽

Music Audio

This paper describes automatic music transcription with chord estimation for music audio signals. We focus on the fact that concurrent structures of musical notes such as chords form the basis of harmony and are considered for music composition. Since chords and musical notes are deeply linked with each other, we propose joint pitch and chord estimation based on a Bayesian hierarchical model that consists of an acoustic model representing the generative process of a spectrogram and a language model representing the generative process of a piano roll. The acoustic model is formulated as a variant of non-negative matrix factorization that has binary variables indicating a piano roll. The language model is formulated as a hidden Markov model that has chord labels as the latent variables and emits a piano roll. The sequential dependency of a piano roll can be represented in the language model. Both models are integrated through a piano roll in a hierarchical Bayesian manner. All the latent variables and parameters are estimated using Gibbs sampling. The experimental results showed the great potential of the proposed method for unified music transcription and grammar induction.

Download Full-text

Generating Sentences by Editing Prototypes

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00030 ◽

2018 ◽

Vol 6 ◽

pp. 437-450 ◽

Cited By ~ 10

Author(s):

Kelvin Guu ◽

Tatsunori B. Hashimoto ◽

Yonatan Oren ◽

Percy Liang

Keyword(s):

Language Model ◽

Language Modeling ◽

Language Models ◽

Training Corpus ◽

Human Evaluation ◽

Sentence Level ◽

Sentence Similarity ◽

Traditional Language ◽

Generative Language

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

Bilingual Mental Lexicon and Collocational Processing

Psycholinguistics and Cognition in Language Processing - Advances in Linguistics and Communication Studies ◽

10.4018/978-1-5225-4009-0.ch011 ◽

2018 ◽

pp. 221-243

Author(s):

Hakan Cangır

Keyword(s):

Language Processing ◽

Cognitive Linguistics ◽

Mental Lexicon ◽

Research Literature ◽

Language Models ◽

Future Research ◽

Formulaic Language ◽

Lexical Activation

The chapter starts with a definition and models of mental dictionary. It then builds on the bilingual lexical activation models and goes on to discuss formulaic language (collocations in particular). After explaining the basics of formulaic language processing, the author attempts to address the issue of lexical and collocational priming theory by Hoey, which has its roots in cognitive linguistics and usage-based language models. Last but not least, some suggestions for future research are provided in an attempt to address the needs of the lexical research literature in the Turkish setting.

Download Full-text

Dynamic Language Models for Streaming Text

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00175 ◽

2014 ◽

Vol 2 ◽

pp. 181-192 ◽

Cited By ~ 6

Author(s):

Dani Yogatama ◽

Chong Wang ◽

Bryan R. Routledge ◽

Noah A. Smith ◽

Eric P. Xing

Keyword(s):

Social Media ◽

Temporal Dynamics ◽

Language Model ◽

Language Modeling ◽

Streaming Data ◽

Language Models ◽

Linguistic Context ◽

Text Data ◽

Competing Models ◽

Context Features

We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.

Download Full-text

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00112 ◽

2016 ◽

Vol 4 ◽

pp. 477-490 ◽

Cited By ~ 4

Author(s):

Ehsan Shareghi ◽

Matthias Petri ◽

Gholamreza Haffari ◽

Trevor Cohn

Keyword(s):

Infinite Order ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Memory Usage ◽

Suffix Trees ◽

Construction Time ◽

Modest Increase ◽

Order Language ◽

Language Modelling

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Download Full-text