scholarly journals Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size

Entropy ◽  
2019 ◽  
Vol 21 (5) ◽  
pp. 464 ◽  
Author(s):  
Alexander Koplenig ◽  
Sascha Wolfer ◽  
Carolin Müller-Spitzer

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.

2018 ◽  
Vol 2 (2) ◽  
pp. 18-25
Author(s):  
Waqas Ahmed ◽  
Qudrat Ullah ◽  
Mughees Ahmed ◽  
Asif Hanif

AbstractBackground: Obstructive lung disease (OLD) is one of the main causes of mortality and morbidity worldwide. Obstructive lung disease is the narrowing of bronchioles mainly due to excessive smooth muscle contraction. The objective of this study is to evaluate the Frequency of HIV in obstructive lung disease patients.Methodology: Samples were collected randomly, and study was completed in almost six months. 100 samples were taken with an informed consent taken from all the patients. EDTA and Clotted blood was collected for HIV ELISA and HIV screening.Results: In this study, 69% Males and 31%Females, 34% Smokers, 26% patients were Hypertensive, 10% patients were diabetic, 3% patients were diagnosed HIV positive by screening and ELISA.Conclusion: The frequency of HIV in obstructive lung disease patients in this research is not very high as compared to the previous researches, showing high frequency and relationship between HIV and obstructive lung disease patients. The reason behind low frequency is due to low sample size so by increasing the sample size we can get better understanding of frequency of HIV in obstructive lung disease patients. Another reason of insignificant results is low prevalence of HIV in Pakistan as compared to the previous researches in certain countries. 


2009 ◽  
Vol 146 (6) ◽  
pp. 917-930 ◽  
Author(s):  
S. HELAMA ◽  
J. K. NIELSEN ◽  
M. MACIAS FAURIA ◽  
I. VALOVIRTA

AbstractA growing body of literature is using sclerochronological information to infer past climates. Sclerochronologies are based on series of skeletal growth records of molluscs that have been correctly aligned in time. Incremental series are obtained from a number of shells to assess the temporal control and improve the climate signal in the final chronology. Much of the sclerochronological theory has been adopted from tree-ring science, due to the longer tradition and more firmly established concepts of chronology construction in dendrochronology. Compared to tree-ring studies, however, sclerochronological datasets are often characterized by relatively small sample size. Here we evaluate how effectively palaeoclimatic signal can be extracted from such a suite of samples. In so doing, the influences of the very basic methods that are applied in nearly every sclerochronological study to remove the non-climatic growth variability prior to palaeoclimatic interpretations, are ranked by their capability to amplify the desired signal. The study is performed in the context of six shells that constitute a bicentennial growth record from annual shell increments of freshwater pearl mussel. It was shown that when the individual series were detrended using the models set by the mean or the median summary curves for ageing (that is, applying Regional Curve Standardization, RCS), instead of fitting the ageing mode statistically to each series, the resulting sclerochronology displayed more low-frequency variability. Consistently, the added low-frequency variability evoked higher proxy–climate correlations. These results show the particular benefit of using the RCS method to develop sclerochronologies and preserve their low-frequency variations. Moreover, calculating the ageing curve and the final chronology by median, instead of mean, resulted in an amplified low-frequency climate signal. The results help to answer a growing need to better understand the behaviour of the sclerochronological data. In addition, we discuss the pitfalls that may potentially disrupt palaeoclimate signal detection in similar sclerochronological studies. Pitfalls may arise from shell taphonomy, water chemistry, time-variant characters of biological growth trends and small sample size.


2018 ◽  
Author(s):  
Anubha Mahajan ◽  
Daniel Taliun ◽  
Matthias Thurner ◽  
Neil R Robertson ◽  
Jason M Torres ◽  
...  

We aggregated genome-wide genotyping data from 32 European-descent GWAS (74,124 T2D cases, 824,006 controls) imputed to high-density reference panels of >30,000 sequenced haplotypes. Analysis of ˜27M variants (˜21M with minor allele frequency [MAF]<5%), identified 243 genome-wide significant loci (p<5×10−8; MAF 0.02%-50%; odds ratio [OR] 1.04-8.05), 135 not previously-implicated in T2D-predisposition. Conditional analyses revealed 160 additional distinct association signals (p<10−5) within the identified loci. The combined set of 403 T2D-risk signals includes 56 low-frequency (0.5%≤MAF<5%) and 24 rare (MAF<0.5%) index SNPs at 60 loci, including 14 with estimated allelic OR>2. Forty-one of the signals displayed effect-size heterogeneity between BMI-unadjusted and adjusted analyses. Increased sample size and improved imputation led to substantially more precise localisation of causal variants than previously attained: at 51 signals, the lead variant after fine-mapping accounted for >80% posterior probability of association (PPA) and at 18 of these, PPA exceeded 99%. Integration with islet regulatory annotations enriched for T2D association further reduced median credible set size (from 42 variants to 32) and extended the number of index variants with PPA>80% to 73. Although most signals mapped to regulatory sequence, we identified 18 genes as human validated therapeutic targets through coding variants that are causal for disease. Genome wide chip heritability accounted for 18% of T2D-risk, and individuals in the 2.5% extremes of a polygenic risk score generated from the GWAS data differed >9-fold in risk. Our observations highlight how increases in sample size and variant diversity deliver enhanced discovery and single-variant resolution of causal T2D-risk alleles, and the consequent impact on mechanistic insights and clinical translation.


2016 ◽  
Author(s):  
◽  
Chris Cotsapas

AbstractA recent study by Wang et al claims the low-frequency variant NR1H3 p.Arg415Gln is pathological for multiple sclerosis and determines a patient’s likelihood of primary progressive disease. We sought to replicate this finding in the International MS Genetics Consortium (IMSGC) patient collection, which is 13-fold larger than the collection of Wang et al, but we find no evidence that this variant is associated either with MS or disease subtype. Wang et al also report a common variant association in the region, which we show captures the association the IMSGC reported in 2013. Therefore, we conclude that the reported low-frequency association is a false positive, likely generated by insufficient sample size. The claim of NR1H3 mutations describing a Mendelian form of MS - of which no examples exist - can therefore not be substantiated by data.


2007 ◽  
Vol 11 (3) ◽  
pp. 437-474 ◽  
Author(s):  
LARS HINRICHS ◽  
BENEDIKT SZMRECSANYI

This study of present-day English genitive variation is based on all interchangeable instances of s- and of-genitives from the ‘Reportage’ and ‘Editorial’ categories of the ‘Brown family’ of corpora. Variation is studied by tapping into a number of independent variables, such as precedence of either construction in the text, length of the possessor and possessum phrases, phonological constraints, discourse flow, and animacy of the possessor. In addition to distributional analyses, we use logistic regression to investigate the probabilistic factor weights of these variables, thus tracking language change in progress as evidenced in the language of the press. This method, married to our large database, yields the most detailed perspective to date on frequently discussed issues, such as the relative importance of possessor animacy and end-weight in genitive choice (cf. most recently Rosenbach 2005), or on the exact factorial dynamics responsible for the ongoing spread of the s-genitive.


2016 ◽  
Author(s):  
Hio-Been Han

AbstractRecent functional magnetic resonance imaging (fMRI) studies have found distinctive functional connectivity in the schizophrenic brain. However, most of the studies focused on the correlation value to define the functional connectivity for BOLD fluctuations between brain regions, which resulted in the limited understanding to the network properties of altered wirings in the schizophrenic brain. Here I characterized the distinctiveness of BOLD connectivity pattern in the schizophrenic brain relative to healthy brain with various similarity measures in the time-frequency domain, while participants are performing the working memory task in the MRI scanner. To assess the distinctiveness of the connectivity pattern, discrimination performances of the pattern classifier machine trained with each similarity measure were compared. Interestingly, the classifier machine trained by time-lagging patterns of low frequency fluctuation (LFF) produced higher classifying sensitivity than the machines trained by other measures. Also, the classifier machine trained by coherence pattern in LFF band also made better performance than the machine trained by correlation-based connectivity pattern. These results indicate that there are unobserved but considerable features in the functional connectivity pattern of schizophrenic brain which traditional emphasis on correlation analysis does not capture.


Author(s):  
C. Sutyarsah

The vocabulary in the texts is the aspect that needs to identify. It is claimedthat the condition of the words in a text has a great influence to readers' comprehension. It is also commonly believed that comprehension depends on the extent that the words in a text are familiar to the readers. This case study was carried out in the English Education Department of University of Malang. The aim of the study is to identify and describe the vocabulary in the text and to seek if the text is useful for reading skill development. The reading materials under investigation were a collection of reading passages based on the syllabus (Reading Comprehension I) and limited to the passages that were used in class during the second semester, 1999. Based on the nature of the investigation, a descriptive qualitative design was applied to obtain the data. For this purpose, some available computer programs were used. They were used to find the description of vocabulary in the texts. The vocabulary analyses in the texts reveal some constrains. It was found that the texts, containing 7,945 words of 20 different texts, are dominated by low frequency words which account for 16.97% of the words in the texts. In terms of high frequency words occurring in the texts, function words dominate the texts. Of the 50 most frequent words, only two content words (people and say) were found. In the case of word level, it was found that the texts being used have very limited number of words from GSL (General Service List of English Words) (West, 1953). The proportion of the first 1,000 words of GSL only accounts for 44.6%. The data also show that the texts contain too large proportion of words which are not in the three levels (the first 2,000 and UWL). These words account for 26.44% of the running words in the texts. Based on the findings, some conclusions were drawn, it is believed that the constraints are due to the selection of the texts which are made of a series of short-unrelated texts (20 different topics). This kind of text is subject to the accumulation of low frequency words especially those of content words and limited of words from GSL. This vocabulary condition could defeat the development of students' readingskills and vocabulary enrichment.


2019 ◽  
Vol 14 (1) ◽  
pp. 208-235
Author(s):  
Jurgita Vaičenonienė ◽  
Jolanta Kovalevskaitė

Summary In Lithuanian public and academic discourse, discussions about the influence of English have received considerable attention. Much has been written on the English borrowings in Lithuanian or various translation strategies applied at word, phrase or syntactic levels of language, whereas there have been only few attempts to investigate how Lithuanian translated from English differs from original language. This is why we found it interesting to investigate lexical an morphological features of translated Lithuanian applying the methods of corpus liguistics. For research purposes, we used a morphologically annotated comparable 4 mln. word corpus of original and translated fiction and popular science literature ORVELIT. It has been found that translations deviate in certain ways from original Lithuanian. Translated Lithuanian has: a lower lexical density, higher proportion of function words, shorter sentences, and higher proportion of list heads; translated fiction has a lower lexical variability and smaller proportion of low frequency words, whereas in popular science translations, these differences are less evident. Keyword analysis has shown content differences in originals and translations and the overuse of personal and possessive pronouns in popular science translations. The distribution of content and function words differs in originals and translations and in different registers. Translated Lithuanian has: more verbs (especially finite forms and adverbial participles), but less nouns and adjectives; fiction translations have less and popular science more adverbs than originals; there are more pronouns and prepositions in both popular science and fiction translations; depending on the register, there are higher or lower numbers of conjunctions, particles and interjections. Some of the differences may be explained by the English language interference as: the overuse of the optional 1st person pronoun in subject position, the overuse of optional preposition “su” with instrumental case, or the overuse of optional link verb in the complex predicate. In other words, the English influence is seen in transferring certain features obligatory for analytical language where omission would be a more natural choice in original Lithuanian. These findings in most cases agree with the previous research on translationese of other languages. It is hoped that the identified tendencies to over- or under-use certain lexical and morphological features as a result of English language interference might appear to be useful when editing and translating.


2021 ◽  
Vol 4 ◽  
Author(s):  
Jelske Dijkstra ◽  
Wilbert Heeringa ◽  
Lysbeth Jongbloed-Faber ◽  
Hans Van de Velde

This paper investigates the usability of Twitter as a resource for the study of language change in progress in low-resource languages. It is a panel study of a vigorous change in progress, the loss of final t in four relative pronouns (dy't, dêr't, wêr't, wa't) in Frisian, a language spoken by ± 450,000 speakers in the north-west of the Netherlands. This paper deals with the issues encountered in retrieving and analyzing tweets in low-resource languages, in the analysis of low-frequency variables, and in gathering background information on Twitterers. In this panel study we were able to identify and track 159 individual Twitterers, whose Frisian (and Dutch) tweets posted in the era 2010–2019 were collected. Nevertheless, a solid analysis of the sociolinguistic factors in this language change in progress was hampered by unequal age distributions among the Twitterers, the fact that the youngest birth cohorts have given up Twitter almost completely after 2014 and that the variables have a low frequency and are unequally spread over Twitterers.


Sign in / Sign up

Export Citation Format

Share Document