scholarly journals Using word n-grams to identify authors and idiolects

2017 ◽  
Vol 22 (2) ◽  
pp. 212-241 ◽  
Author(s):  
David Wright

Abstract Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.

Target ◽  
2018 ◽  
Vol 30 (2) ◽  
pp. 240-259 ◽  
Author(s):  
Lorenzo Mastropierro

Abstract This paper discusses the issue of translator style through the comparison of two Italian translations of H. P. Lovecraft’s At the Mountains of Madness. Using a corpus linguistic approach, this paper proposes a method for the identification of potential indicators of translator style based on key cluster analysis. Comparing the two translations with this method identifies which clusters – i.e., repeated sequences of words – are used more frequently by one translator compared to the other. The analysis shows that the two translators differ in their usage of some linguistic features, specifically Italian euphonic -d, locative clitics, and distal demonstratives, which are then analysed as stylistic divergences.


Entropy ◽  
2021 ◽  
Vol 23 (4) ◽  
pp. 421
Author(s):  
Dariusz Puchala ◽  
Kamil Stokfiszewski ◽  
Mykhaylo Yatsymirskyy

In this paper, the authors analyze in more details an image encryption scheme, proposed by the authors in their earlier work, which preserves input image statistics and can be used in connection with the JPEG compression standard. The image encryption process takes advantage of fast linear transforms parametrized with private keys and is carried out prior to the compression stage in a way that does not alter those statistical characteristics of the input image that are crucial from the point of view of the subsequent compression. This feature makes the encryption process transparent to the compression stage and enables the JPEG algorithm to maintain its full compression capabilities even though it operates on the encrypted image data. The main advantage of the considered approach is the fact that the JPEG algorithm can be used without any modifications as a part of the encrypt-then-compress image processing framework. The paper includes a detailed mathematical model of the examined scheme allowing for theoretical analysis of the impact of the image encryption step on the effectiveness of the compression process. The combinatorial and statistical analysis of the encryption process is also included and it allows to evaluate its cryptographic strength. In addition, the paper considers several practical use-case scenarios with different characteristics of the compression and encryption stages. The final part of the paper contains the additional results of the experimental studies regarding general effectiveness of the presented scheme. The results show that for a wide range of compression ratios the considered scheme performs comparably to the JPEG algorithm alone, that is, without the encryption stage, in terms of the quality measures of reconstructed images. Moreover, the results of statistical analysis as well as those obtained with generally approved quality measures of image cryptographic systems, prove high strength and efficiency of the scheme’s encryption stage.


Author(s):  
Gyula Zsombok

ABSTRACT In France, English is often perceived as a negative influence on the language in the eyes of purist institutions like the French Academy. Terminological commissions have been established to replace foreign expressions with French terminology that is regularly published in the Journal officiel de la République française. Although the Toubon Law of 1994 prescribes the use of this terminology in government publications, speakers are merely encouraged to do so. This article investigates the variation between English lexical borrowings and their prescribed equivalents in a large newspaper corpus containing articles from 2000 to 2017 in order to see whether formal written language complies with the purist recommendations. Time is treated with a new dynamic approach: the probability of using a prescribed term is estimated three years before and three years after official prescription. Fifty-four target terms are selected from the lexical fields of computer science, entertainment industry and telecommunication, including emblematic prescribed words such as courriel and mot-dièse. The analysis reveals that prescription is only effective when it follows already attested use. Furthermore, conservative newspapers show higher proportions of recommended terminology, especially as compared to newspapers specializing in technology.


2020 ◽  
Vol 8 ◽  
Author(s):  
Devasis Bassu ◽  
Peter W. Jones ◽  
Linda Ness ◽  
David Shallcross

Abstract In this paper, we present a theoretical foundation for a representation of a data set as a measure in a very large hierarchically parametrized family of positive measures, whose parameters can be computed explicitly (rather than estimated by optimization), and illustrate its applicability to a wide range of data types. The preprocessing step then consists of representing data sets as simple measures. The theoretical foundation consists of a dyadic product formula representation lemma, and a visualization theorem. We also define an additive multiscale noise model that can be used to sample from dyadic measures and a more general multiplicative multiscale noise model that can be used to perturb continuous functions, Borel measures, and dyadic measures. The first two results are based on theorems in [15, 3, 1]. The representation uses the very simple concept of a dyadic tree and hence is widely applicable, easily understood, and easily computed. Since the data sample is represented as a measure, subsequent analysis can exploit statistical and measure theoretic concepts and theories. Because the representation uses the very simple concept of a dyadic tree defined on the universe of a data set, and the parameters are simply and explicitly computable and easily interpretable and visualizable, we hope that this approach will be broadly useful to mathematicians, statisticians, and computer scientists who are intrigued by or involved in data science, including its mathematical foundations.


Author(s):  
Ilze Zumente ◽  
Nataļja Lāce ◽  
Jūlija Bistrova

The goal of this article is to provide evidence on the volume of ESG disclosures of 34 companies listed on the NASDAQ Baltic stock exchange. It provides a broad view of the non-financial disclosure thoroughness and offers conclusions on the key characteristics of the Baltic listed companies in terms of ESG. By performing content analysis of the publicly available reports based on 106 ESG criteria and statistical analysis of the retrieved data, the disclosure patterns across reporting dimensions, industries, and company characteristics are analyzed. Authors find a wide range (8% to 67%) ESG transparency scores with an average of 41%. On aggregate, governance and social dimensions are reported better (49% and 44%) than environmental (24%). Correlation analysis was performed to test the correlation between ESG and selected financial metrics revealing that the ESG disclosure score correlates with the firm’s market capitalization.


Linguistics ◽  
2021 ◽  

Register research has been approached from differing theoretical and methodological approaches, resulting in different definitions of the term register. In the text-linguistic approach, which is the primary focus of this bibliography, register refers to text varieties that are defined by their situational characteristics, such as the purpose of writing and the mode of communication, among others. Texts that are similar in their situational characteristics also tend to share similar linguistic profiles, as situational characteristics motivate or require the use of specific linguistic features. Text-linguistic research on register tends to focus on two aspects: attempts to describe a register, or attempts to understand patterns of register variation. This research happens via comparative analyses, specific examinations of single linguistic features or situational parameters, and often via examinations of co-occurrence of linguistic features that are analyzed from a functional perspective. That is, certain lexico-grammatical features co-occur in a given text because they together serve important communicative functions that are motivated by the situational characteristics of the text (e.g., communicative purpose, mode, setting, interactivity). Furthermore, corpus methods are often relied upon in register studies, which allows for large-scale examinations of both general and specialized registers. Thus, the bibliography gives priority to research that uses corpus tools and methods. Finally, while the broadest examinations on register focus on the distinction between written and spoken domains, additional divisions of register studies fall under the categories of written registers, spoken registers, academic registers, historical registers, and electronic/online registers. This bibliography primarily introduces some of the key resources on English registers, a decision that was made to reach a broader audience.


2018 ◽  
Vol 16 (1) ◽  
pp. 78-96
Author(s):  
Ohad Abudraham

Abstract The present article presents four new linguistic features that link Early-Mandaic and Neo-Mandaic: 1. Diphthongisation and fortition of long vowels ū/ī (ࡈࡁࡅࡊࡕࡀ ṭbukta instead of ࡈࡀࡁࡅࡕࡀ ṭabuta “grace”, ࡀࡓࡁࡉࡊࡕࡉࡍࡊࡉࡀ arbiktinkia instead of ࡀࡓࡁࡉࡕࡉࡍࡊࡉࡀ arbitinkia “four of you [f.pl.]”); 2. Apheresis of y in the gentilic noun ‮יהודיא‬‎ (ࡄࡅࡃࡀࡉࡉࡀ hudaiia “Jews”); 3. Assimilation of z in the root ʾzl (ࡕࡏࡋࡅࡍ tʿlun “you [m.pl.] will go”); and 4. Internal analogy in the system of cardinal numbers (ࡕࡀࡓࡕࡀ tarta “two”). The presence of these forms in the two extreme phases of the language as opposed to their almost total absence in the canonical collections of Mandaic scriptures prove not only the ancient origin of some Neo-Mandaic peculiarities but also the wide range of varieties of Mandaic that flourished in Mesopotamia in Late Antiquity.


2018 ◽  
Vol 31 (4) ◽  
pp. 583-602 ◽  
Author(s):  
Stephen Skalicky

Abstract Satire is a type of discourse commonly employed to mock or criticize a satirical target, typically resulting in humor. Current understandings of satire place strong emphasis on the role that background and pragmatic knowledge play during satire recognition. However, there may also be specific linguistic cues that signal a satirical intent. Researchers using corpus linguistic methods, specifically Lexical Priming, have demonstrated that other types of creative language use, such as irony, puns, and verbal jokes, purposefully deviate from expected language patterns (e.g. collocations). The purpose of this study is to investigate whether humorous satirical headlines also subvert typical linguistic patterns using the theory of Lexical Priming. In order to do so, a corpus of newspaper headlines taken from the satirical American newspaper The Onion are analyzed and compared to a generalized corpus of American English. Results of this analysis suggest satirical headlines exploit linguistic expectations through the use of low-frequency collocations and semantic preferences, but also contain higher discourse and genre level deviations that cannot be captured in the surface level linguistic features of the headlines.


2020 ◽  
pp. 109-117 ◽  
Author(s):  
Wei Hung ◽  
Yao-Wen Hsu

This analysis focusses on the effects of Information Technology (IT) and how it significantly affects the Supply Chain Management (SCM) in logistics and manufacturing-Small and Medium-Sized Enterprises (SMEs). Apart from that, our purpose is to evaluate how IT affects the Organizational Performance (OP) in the enterprises. Irrespective of the fact that IT cannot be applied in every enterprise, the findings in this research are based on the statistical analysis which shows that a wide-range of workforce in the modern age has adopted the initiative considering the complexities of SCM and mostly to maximize OP in the enterprises. This research was done based on the analysis of SMEs in logistics and manufacturing sector in India. The sample used to conduct this research makes it valid to draw assumptions that managers and CEOs are responsible for coordinating enterprise operations in SMEs. The evaluation in this research shows that the workforce is obliged to formulate strategies to allow employees to enhance their competency of IT. In that regard, the findings are essential for the enhancement of the decision-making process, SCM and OP.


Sign in / Sign up

Export Citation Format

Share Document