Quantifying Geographical Determinants of Large-Scale Distributions of Linguistic Features

In this article, the authors propose a novel search model: Multi-Target Search (MT search in brief). MT search is a keyword-based search model on Semantic Associations in Linked Data. Each search contains multiple sub-queries, in which each sub-query represents a certain user need for a certain object in a group relationship. They first formularize the problem of association search, and then introduce their approach to discover Semantic Associations in large-scale Linked Data. Next, they elaborate their novel search model, the notion of Virtual Document they use to extract linguistic features, and the details of search process. The authors then discuss the way search results are organized and summarized. Quantitative experiments are conducted on DBpedia to validate the effectiveness and efficiency of their approach.

Download Full-text

Linguistic effects on news headline success: Evidence from thousands of online field experiments (Registered Report Protocol)

PLoS ONE ◽

10.1371/journal.pone.0257091 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257091

Author(s):

Kristina Gligorić ◽

George Lifchits ◽

Robert West ◽

Ashton Anderson

Keyword(s):

Large Scale ◽

Field Experiments ◽

Linguistic Features ◽

Written Text ◽

Psychological Mechanisms ◽

Large Scale Dataset ◽

Prior Literature ◽

Competing Hypotheses

What makes written text appealing? In this registered report protocol, we propose to study the linguistic characteristics of news headline success using a large-scale dataset of field experiments (A/B tests) conducted on the popular website Upworthy comparing multiple headline variants for the same news articles. This unique setup allows us to control for factors that can have crucial confounding effects on headline success. Based on prior literature and a pilot partition of the data, we formulate hypotheses about the linguistic features that are associated with statistically superior headlines. We will test our hypotheses on a much larger partition of the data that will become available after the publication of this registered report protocol. Our results will contribute to resolving competing hypotheses about the linguistic features that affect the success of text and will provide avenues for research into the psychological mechanisms that are activated by those features.

Download Full-text

Register and Register Variation

Linguistics ◽

10.1093/obo/9780199772810-0278 ◽

2021 ◽

Keyword(s):

Large Scale ◽

Linguistic Features ◽

Register Variation ◽

Functional Perspective ◽

Communicative Functions ◽

Situational Characteristics ◽

Register Studies ◽

Primary Focus ◽

Key Resources ◽

Linguistic Approach

Register research has been approached from differing theoretical and methodological approaches, resulting in different definitions of the term register. In the text-linguistic approach, which is the primary focus of this bibliography, register refers to text varieties that are defined by their situational characteristics, such as the purpose of writing and the mode of communication, among others. Texts that are similar in their situational characteristics also tend to share similar linguistic profiles, as situational characteristics motivate or require the use of specific linguistic features. Text-linguistic research on register tends to focus on two aspects: attempts to describe a register, or attempts to understand patterns of register variation. This research happens via comparative analyses, specific examinations of single linguistic features or situational parameters, and often via examinations of co-occurrence of linguistic features that are analyzed from a functional perspective. That is, certain lexico-grammatical features co-occur in a given text because they together serve important communicative functions that are motivated by the situational characteristics of the text (e.g., communicative purpose, mode, setting, interactivity). Furthermore, corpus methods are often relied upon in register studies, which allows for large-scale examinations of both general and specialized registers. Thus, the bibliography gives priority to research that uses corpus tools and methods. Finally, while the broadest examinations on register focus on the distinction between written and spoken domains, additional divisions of register studies fall under the categories of written registers, spoken registers, academic registers, historical registers, and electronic/online registers. This bibliography primarily introduces some of the key resources on English registers, a decision that was made to reach a broader audience.

Download Full-text

THE LINGUISTIC NATURE OF ONOMATOPOEIA

Verhnevolzhski Philological Bulletin ◽

10.20323/2499-9679-2020-4-23-67-73 ◽

2020 ◽

Vol 23 (4) ◽

pp. 67-73

Author(s):

Marina A. Droga ◽

◽

Nataliya V. Yurchenko ◽

Svetlana V. Funikova ◽

◽

...

Keyword(s):

Functional Role ◽

Large Scale ◽

Linguistic Features ◽

Cultural Elements ◽

Surrounding World ◽

Oral Speech ◽

Points Of View ◽

The Difference ◽

Writing Chinese

The problem of onomatopoeias as a special lexical group has existed in the language for many decades. Onomatopoeias imitate the sounds of nature, the language of animals, objects of the surrounding world. In the text, onomatopoeia can perform various functions: emotional influence, imitation, as well as the function of language economy. But one of its main functions remains sound imaging. In Russia and China, different language pictures, specific cultural elements and linguistic features are noted. All this confirms the large-scale differences in the sound imitations of both languages, and in various aspects: in the composition of the components, in the functional role, in the meanings. Despite the fact that the differences in the phonetic system of Russian and Chinese are quite large, the onomatopoeias and their functions in the languages under consideration have the same features. Onomatopes are an expression of the same emotions, feelings, sounds both in oral speech and in writing. Chinese onomatopes are a graphic copy that attributes us to the actual sounding. This fact makes onomatopoeias in Chinese similar to onomatopes in Russian. The connection of sound and meaning is especially important: linguists study the nature of this connection from different points of view. It is also important to note the difference between sound imitations and similar interjections. Onomatopes are not only part of the system of the Russian and Chinese languages, but are also a progressive link that develops the resources of the language, its word-forming capabilities, as well as the expressive sphere of expression.

Download Full-text

Geographical axis effects in large-scale linguistic distributions

Language Dispersal, Diversification, and Contact ◽

10.1093/oso/9780198723813.003.0004 ◽

2020 ◽

pp. 58-77

Author(s):

Tom Güldemann ◽

Harald Hammarström

Keyword(s):

Large Scale ◽

Linguistic Diversity ◽

Human History ◽

Linguistic Features ◽

Large Space ◽

Language Groups ◽

Comprehensive Theory ◽

Geographical Factors ◽

First Results ◽

Long Time

Taking up Diamond’s (1999) geographical axis hypothesis regarding the different population histories of continental areas, Güldemann (2008, 2010) proposed that macro-areal aggregations of linguistic features are influenced by geographical factors. This chapter explores this idea by extending it to the whole world in testing whether the way linguistic features assemble over long time spans and large space is influenced by what we call “latitude spread potential” and “longitude spread constraint.” Regarding the former, the authors argue in particular that contact-induced feature distributions as well as genealogically defined language groups with a sufficient geographical extension tend to have a latitudinal orientation. Regarding the latter, the authors provide first results suggesting that linguistic diversity within language families tends to be higher along longitude axes. If replicated by more extensive and diverse testing, the authors’ findings promise to become important ingredients for a comprehensive theory of human history across space and time within linguistics and beyond.

Download Full-text

Developing linguistic literacy: perspectives from corpus linguistics and multi-dimensional analysis

Journal of Child Language ◽

10.1017/s0305000902235345 ◽

2002 ◽

Vol 29 (2) ◽

pp. 449-488 ◽

Cited By ~ 3

Author(s):

DOUGLAS BIBER ◽

RANDI REPPEN ◽

SUSAN CONRAD

Keyword(s):

Literacy Development ◽

Corpus Linguistics ◽

Large Scale ◽

List Type ◽

Linguistic Features ◽

The Past ◽

Patterns Of Use ◽

Special Concern ◽

Literacy Perspectives ◽

Research Studies

In their conceptual framework for linguistic literacy development, Ravid & Tolchinsky synthesize research studies from several perspectives. One of these is corpus-based research, which has been used for several large-scale research studies of spoken and written registers over the past 20 years. In this approach, a large, principled collection of natural texts (a ‘corpus’) is analysed using computational and interactive techniques, to identify the salient linguistic characteristics of each register or text variety. Three characteristics of corpus-based analysis are particularly important (see Biber, Conrad & Reppen 1998):[bull ] a special concern for the representativeness of the text sample being analysed, and for the generalizability of findings;[bull ] overt recognition of the interactions among linguistic features: the ways in which features co-occur and alternate;[bull ] a focus on register as the most important parameter of linguistic variation: strong patterns of use in one register often represent only weak patterns in other registers.

Download Full-text

C Versus Fortran-77 for Scientific Programming

Scientific Programming ◽

10.1155/1992/439510 ◽

1992 ◽

Vol 1 (2) ◽

pp. 99-114

Author(s):

Tom MacDonald

Keyword(s):

Programming Language ◽

Large Scale ◽

Scientific Applications ◽

Linguistic Features ◽

Standard Library ◽

Fortran 77

The predominant programming language for numeric and scientific applications is Fortran-77 and supercomputers are primarily used to run large-scale numeric and scientific applications. Standard C* is not widely used for numerical and scientific programming, yet Standard C provides many desirable linguistic features not present in Fortran-77. Furthermore, the existence of a standard library and preprocessor eliminates the worst portability problems. A comparison of Standard C and Fortran-77 shows several key deficiencies in C that reduce its ability to adequately solve some numerical problems. Some of these problems have already been addressed by the C standard but others remain. Standard C with a few extensions and modifications could be suitable for all numerical applications and could become more popular in supercomputing environments.

Download Full-text

Towards enhancement of a lexicon-based approach for Saudi dialect sentiment analysis

Journal of Information Science ◽

10.1177/0165551516688143 ◽

2017 ◽

Vol 44 (2) ◽

pp. 184-202 ◽

Cited By ~ 12

Author(s):

Adel Assiri ◽

Ahmed Emam ◽

Hmood Al-Dossari

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

Training Data ◽

Linguistic Features ◽

Algorithm Development ◽

Domain Independence ◽

Low Performance

Sentiment analysis (SA) techniques are applied to assess aspects of language that are used to express feelings, evaluations and opinions in areas such as customer sentiment extraction. Most studies have focused on SA techniques for widely used languages such as English, but less attention has been paid to Arabic, particularly the Saudi dialect. Most Arabic SA studies have built systems using supervised approaches that are domain dependent; hence, they achieve low performance when applied to a new domain different from the learning domain, and they require manually labelled training data, which are usually difficult to obtain. In this article, we propose a novel lexicon-based algorithm for Saudi dialect SA that features domain independence. We created an annotated Saudi dialect dataset and built a large-scale lexicon for the Saudi dialect. Then, we developed our weighted lexicon-based algorithm. The proposed algorithm mines the associations between polarity and non-polarity words for the dataset and then weights these words based on their associations. During algorithm development, we also proposed novel rules for handling some linguistic features such as negation and supplication. Several experiments were performed to evaluate the performance of the proposed algorithm.

Download Full-text

Dimensions of variation across American television registers

International Journal of Corpus Linguistics ◽

10.1075/ijcl.15014.ber ◽

2019 ◽

Vol 24 (1) ◽

pp. 3-32 ◽

Cited By ~ 4

Author(s):

Tony Berber Sardinha ◽

Marcia Veirano Pinto

Keyword(s):

Mass Communication ◽

Large Scale ◽

Multidimensional Analysis ◽

Television Programs ◽

Linguistic Features ◽

American Television ◽

Linguistic Patterns ◽

Dimension 2 ◽

Dimensions Of Variation ◽

Large Corpus

Abstract The goal of this study is to identify the dimensions of variation across American television programs, following the multidimensional analysis (MD) framework introduced by Biber (1988). Although television is a major form of mass communication, there has been no previous large-scale MD study of television dialogue. A large corpus containing the key types of contemporary American television programs was collected, annotated with the Biber tagger, and subjected to multi-dimensional analysis, which indicated four factors of statistically correlated linguistic features. Each of these factors was interpreted communicatively to reveal the underlying dimensions of variation on American television, namely “Exposition and discussion vs. Simplified interaction” (Dimension 1), “Simulated conversation” (Dimension 2), “Recount” (Dimension 3) and “Engaging presentation” (Dimension 4). This article presents, illustrates, and discusses each of these dimensions, showing the macro linguistic patterns in use across hundreds of American television programs.

Download Full-text

Learning Hierarchical Lexical Hyponymy

Developments in Natural Intelligence Research and Knowledge Engineering ◽

10.4018/978-1-4666-1743-8.ch015 ◽

2012 ◽

pp. 205-219

Author(s):

Jiayu Zhou ◽

Shi Wang ◽

Cungen Cao

Keyword(s):

Information Processing ◽

Machine Translation ◽

Large Scale ◽

Computational Approach ◽

Experimental Results ◽

Ontology Learning ◽

Linguistic Features ◽

Critical Step ◽

Chinese Information Processing ◽

Novel Approach

Chinese information processing is a critical step toward cognitive linguistic applications like machine translation. Lexical hyponymy relation, which exists in some Eastern languages like Chinese, is a kind of hyponymy that can be directly inferred from the lexical compositions of concepts, and of great importance in ontology learning. However, a key problem is that the lexical hyponymy is so commonsense that it cannot be discovered by any existing acquisition methods. In this paper, we systematically define lexical hyponymy relationship, its linguistic features and propose a computational approach to semi-automatically learn hierarchical lexical hyponymy relations from a large-scale concept set, instead of analyzing lexical structures of concepts. Our novel approach discovered lexical hyponymy relation by examining statistic features in a Common Suffix Tree. The experimental results show that our approach can correctly discover most lexical hyponymy relations in a given large-scale concept set.

Download Full-text