Frequency, informativity and word length: Insights from typologically diverse corpora

Mapping Intimacies ◽

10.31234/osf.io/sdjur ◽

2021 ◽

Author(s):

Natalia Levshina

Keyword(s):

Word Frequency ◽

Negative Correlation ◽

Word Length ◽

Noun Phrases ◽

Strongly Correlated ◽

Zipf's Law ◽

Web Based ◽

Linguistic Differences ◽

Morphological Complexity ◽

Methodological Choices

Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) can be more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study, which examines a more diverse sample of languages than in the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish), reveals intriguing cross-linguistic differences, which can be explained by typological properties of the languages. I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters, as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show consistent cross-linguistic differences in the size of correlations between word length and the corpus-based measures. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.

Download Full-text

The Challenges of Large-Scale, Web-based Language Datasets: Word Length and Predictability Revisited

10.31234/osf.io/6832r ◽

2021 ◽

Author(s):

Stephan Meylan ◽

Tom Griffiths

Keyword(s):

Best Practices ◽

Word Frequency ◽

Word Length ◽

Large Scale ◽

Strongly Correlated ◽

Web Based ◽

Language Research ◽

Efficient Communication ◽

Average Information Content ◽

Average Information

Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting "Word lengths are optimized for efficient communication" (Piantadosi, Tily, & Gibson,2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result, and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.

Download Full-text

ZIPF'S LAW AND RANDOM TEXTS

Advances in Complex Systems ◽

10.1142/s0219525902000468 ◽

2002 ◽

Vol 05 (01) ◽

pp. 1-6 ◽

Cited By ~ 38

Author(s):

RAMON FERRER i CANCHO ◽

RICARD V. SOLÉ

Keyword(s):

Word Frequency ◽

Power Law ◽

Word Length ◽

A Priori ◽

Zipf’S Law ◽

Strict Sense ◽

A Priori Information ◽

Zipf's Law ◽

Priori Information ◽

Random Text

Random-text models have been proposed as an explanation for the power law relationship between word frequency and rank, the so-called Zipf's law. They are generally regarded as null hypotheses rather than models in the strict sense. In this context, recent theories of language emergence and evolution assume this law as a priori information with no need of explanation. Here, random texts and real texts are compared through (a) the so-called lexical spectrum and (b) the distribution of words having the same length. It is shown that real texts fill the lexical spectrum much more efficiently and regardless of the word length, suggesting that the meaningfulness of Zipf's law is high.

Download Full-text

The Rules of Early Stuttering

Journal of Speech and Hearing Disorders ◽

10.1044/jshd.3904.379 ◽

1974 ◽

Vol 39 (4) ◽

pp. 379-394 ◽

Cited By ~ 41

Author(s):

Oliver Bloodstein

Keyword(s):

Conceptual Model ◽

Word Frequency ◽

Early Phase ◽

Word Length ◽

Noun Phrases ◽

Syntactic Structures ◽

Prepositional Phrases ◽

Subordinate Clauses ◽

Verb Phrases

Brief samples of the speech of six stuttering children, aged three to six years, are analyzed on the basis of a conceptual model of stuttering as tension and fragmentation in speech. The hypothesis is advanced that while the older stutterer tends to fragment words, the early phase of stuttering is characterized chiefly by fragmentation of whole syntactic structures such as sentences, coordinate and subordinate clauses, verb phrases, noun phrases, and prepositional phrases. This is suggested by the predominance of repetitions of words and other large fragments, by their occurrence at the beginnings of syntactic structures, and by their absence from the ends of such structures. The young stutterer’s frequent tendency to stutter on pronouns and conjunctions is related to the model, and the prediction is made that the loci of early stuttering will not prove to be influenced directly by word-bound factors such as initial sound, word length, or word frequency.

Download Full-text

The Challenges of Large‐Scale, Web‐Based Language Datasets: Word Length and Predictability Revisited

Cognitive Science ◽

10.1111/cogs.12983 ◽

2021 ◽

Vol 45 (6) ◽

Author(s):

Stephan C. Meylan ◽

Thomas L. Griffiths

Keyword(s):

Word Length ◽

Large Scale ◽

Web Based

Download Full-text

Optimization of morpheme length: a cross-linguistic assessment of Zipf’s and Menzerath’s laws

Linguistics Vanguard ◽

10.1515/lingvan-2019-0076 ◽

2021 ◽

Vol 7 (s3) ◽

Author(s):

Matthew Stave ◽

Ludger Paschen ◽

François Pellegrino ◽

Frank Seifart

Keyword(s):

Structural Information ◽

Unit Length ◽

Zipf’S Law ◽

Zipf's Law ◽

Linguistic Structure ◽

Morphological Complexity ◽

Linguistic Assessment ◽

Linguistic Units

Abstract Zipf’s Law of Abbreviation and Menzerath’s Law both make predictions about the length of linguistic units, based on corpus frequency and the length of the carrier unit. Each contributes to the efficiency of languages: for Zipf, units are more likely to be reduced when they are highly predictable, due to their frequency; for Menzerath, units are more likely to be reduced when there are more sub-units to contribute to the structural information of the carrier unit. However, it remains unclear how the two laws work together in determining unit length at a given level of linguistic structure. We examine this question regarding the length of morphemes in spoken corpora of nine typologically diverse languages drawn from the DoReCo corpus, showing that Zipf’s Law is a stronger predictor, but that the two laws interact with one another. We also explore how this is affected by specific typological characteristics, such as morphological complexity.

Download Full-text

Generational Gaps in Media Trust and its Antecedents in Europe

The International Journal of Press/Politics ◽

10.1177/19401612211039440 ◽

2021 ◽

pp. 194016122110394

Author(s):

Anna Brosius ◽

Jakob Ohme ◽

Claes H. de Vreese

Keyword(s):

Negative Correlation ◽

Generational Differences ◽

Political Trust ◽

Media Bias ◽

Political Interest ◽

Strongly Correlated ◽

Strong Negative Correlation ◽

Media Trust ◽

Original Survey

We test generational differences in media trust and its antecedents, including political trust, interest, and orientation, as well as perceptions of media inaccuracy and media bias. We rely on original survey data from ten European countries, collected in 2019. We find no differences in the levels of media trust between generations, but we find that key correlates of media trust relate differently to it in different generations. For example, political interest is more strongly correlated with media trust for Millennials than for other generations. Perceptions of bias and inaccuracy have a strong negative correlation with media trust overall, but it is stronger for older generations. These results suggest, that in the long term, societal developments, and in particular debates about media bias and misinformation may impact media trust of young generations differently as they grow older—however, our data give no indication of that creating generational gaps in media trust.

Download Full-text

Picture-naming agreement in monolingulas and biliguals

Applied Psycholinguistics ◽

10.1017/s0142716400005312 ◽

1994 ◽

Vol 15 (2) ◽

pp. 177-193 ◽

Cited By ~ 7

Author(s):

Judith P. Goggin ◽

Patricia Estrada ◽

Ronald P. Villarreal

Keyword(s):

Word Frequency ◽

Picture Naming ◽

Word Length ◽

Age Of Acquisition ◽

Language Skill ◽

Frequency Word ◽

Line Drawings ◽

Name Agreement ◽

The Relationship

ABSTRACTName agreement in Spanish and English in response to 264 pictures was assessed in monolinguals and in bilinguals, who varied in rated skill in the two languages. Most of the pictures were adapted from a standardized set of line drawings of common objects (Snodgrass & Vanderwart, 1980). Name agreement decreased as language skill decreased, and agreement was lower when labels were given in Spanish rather than in English. The relationship between name agreement and word frequency, word length, and (in the case of English) age of acquisition was assessed; both word frequency and word length were found to be related to agreement. Modal responses given by monolingual subjects were nearly identical in the two languages, and the types of non-modal responses were affected by both naming language and language skill.

Download Full-text

Empathy and compassion toward other species decrease with evolutionary divergence time

Scientific Reports ◽

10.1038/s41598-019-56006-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 8

Author(s):

Aurélien Miralles ◽

Michel Raymond ◽

Guillaume Lecointre

Keyword(s):

Negative Correlation ◽

Online Survey ◽

Divergence Time ◽

Affective Responses ◽

Evolutionary Divergence ◽

Strongly Correlated ◽

Minimum Level ◽

Strong Negative Correlation ◽

Human Relationships ◽

Time Of Divergence

AbstractCurrently the planet is inhabited by several millions of extremely diversified species. Not all of them arouse emotions of the same nature or intensity in humans. Little is known about the extent of our affective responses toward them and the factors that may explain these differences. Our online survey involved 3500 raters who had to make choices depending on specific questions designed to either assess their empathic perceptions or their compassionate reactions toward an extended photographic sampling of organisms. Results show a strong negative correlation between empathy scores and the divergence time separating them from us. However, beyond a certain time of divergence, our empathic perceptions stabilize at a minimum level. Compassion scores, although based on less spontaneous choices, remain strongly correlated to empathy scores and time of divergence. The mosaic of features characterizing humans has been acquired gradually over the course of the evolution, and the phylogenetically closer a species is to us, the more it shares common traits with us. Our results could be explained by the fact that many of these traits may arouse sensory biases. These anthropomorphic signals could be able to mobilize cognitive circuitry and to trigger prosocial behaviors usually at work in human relationships.

Download Full-text