Computational stylistics and authorship attribution: what it measures and why it works

Mapping Intimacies ◽

10.31237/osf.io/pcnvj ◽

2020 ◽

Author(s):

Jeremi Ochab

Keyword(s):

English Language ◽

State Of The Art ◽

The Other ◽

Supervised Machine Learning ◽

Authorship Attribution ◽

Statistical Distributions ◽

Clustering Methods ◽

Parts Of Speech ◽

Part Of Speech ◽

Automatic Grouping

The topic of this thesis is the computational methods for measurement of authorialstyle and algorithms of authorial attribution.The first aim of the thesis was an attempt at a quantifiable separation of various layers of authorial style (in the present case the lexical and grammatical layers) in order to estimate their influence on the results of a chosen method of authorial attribution. Within the scope of these studies I compared the distance, so called Burrows's Delta, between a pair of English novels by two chosen authors and automatically generated texts, whose statistical distributions of parts of speech were borrowed from one of the authors, while the vocabulary from the other one; additionally, in the computatrificial texts I left the sets of words of the first author if they belonged to a particular part of speech. Such procedure allowed to create a hybrid text, which was attributed to the first author, even though the majority of lexical items were that of the second author.The second aim was to identify the influences of the style and language of the original on the style of the translation. This part of research involved among others adapting Polish and English part of speech tag sets to form a common translatorial tag set. Beside making a couple of simple observations concerning the distributions and coocurrences of parts of speech in the two languages, I managed to determine some features of the selected translatorial corpus, which lie on the fringes of what seems a norm for Polish.The third aim was testing the accuracy of state of the art (unsupervised) clustering methods for automatic grouping of texts according to their author. The results show that the methods recognise authorship worse than the known supervised machine learning methods.In the thesis I made use of corpora totalling around 550 digitised English language novels and 100 Polish ones, as well as a parallel corpus of 39 novels of a single English author together with their translations by a single Polish translator. The research conducted involved utilising existing part of speech taggers (both for English and Polish), authorship attribution programmes, and programmes for graph clustering.

Download Full-text

On Word ‘Fake’ and What Is Behind It

Respectus Philologicus ◽

10.15388/respectus.2015.28.33.9 ◽

2015 ◽

Vol 28 (33) ◽

pp. 89-98

Author(s):

Kazimierz Luciński

Keyword(s):

English Language ◽

Native Speakers ◽

The Other ◽

Parts Of Speech ◽

New Words ◽

Linguistic Description ◽

Part Of Speech ◽

Russian Soil

The paper focuses on a loanword borrowed from the English language “fake” that became very popular in the Russian soil. The author shows the derivational abilities of a word leading to the formation of new words that belong to the other parts of speech: a verb “фейковать”, an adjective “фейковый”, a noun “фейковость”. Every part of speech is analysed on the basis of paradigmatic relations, in which the word is involved, along with its sociolinguistic characteristics such as the field of use and social strata to which the native speakers belong. The author does not limit himself to an ordinary linguistic description of this loanword and its sense-correlates; instead the author tries to present socio-cultural peculiarities of reality that made possible a wide use of this loanword borrowed from the English language.

Download Full-text

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2018-0007 ◽

2018 ◽

Vol 2018 (1) ◽

pp. 127-144 ◽

Cited By ~ 3

Author(s):

Lucy Simko ◽

Luke Zettlemoyer ◽

Tadayoshi Kohno

Keyword(s):

Qualitative Analysis ◽

State Of The Art ◽

Source Code ◽

The Other ◽

Authorship Attribution

Abstract Source code attribution classifiers have recently become powerful. We consider the possibility that an adversary could craft code with the intention of causing a misclassification, i.e., creating a forgery of another author’s programming style in order to hide the forger’s own identity or blame the other author. We find that it is possible for a non-expert adversary to defeat such a system. In order to inform the design of adversarially resistant source code attribution classifiers, we conduct two studies with C/C++ programmers to explore the potential tactics and capabilities both of such adversaries and, conversely, of human analysts doing source code authorship attribution. Through the quantitative and qualitative analysis of these studies, we (1) evaluate a state-of-the-art machine classifier against forgeries, (2) evaluate programmers as human analysts/forgery detectors, and (3) compile a set of modifications made to create forgeries. Based on our analyses, we then suggest features that future source code attribution systems might incorporate in order to be adversarially resistant.

Download Full-text

Nationality swapping in the Olympic Games 1978–2017: A supervised machine learning approach to analysing discourses of citizenship and nationhood

International Review for the Sociology of Sport ◽

10.1177/1012690218773969 ◽

2018 ◽

Vol 54 (8) ◽

pp. 971-988

Author(s):

Joost Jansen

Keyword(s):

Machine Learning ◽

English Language ◽

State Of The Art ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Olympic Athletes ◽

The Past ◽

Learning Techniques ◽

Machine Learning Approach ◽

Media Reports

While the practice of nationality swapping in sports traces back as far as the Ancient Olympics, it seems to have increased over the past decades. Cases of Olympic athletes who switched their national allegiances are often surrounded with controversy. Two strands of thought could help explain this controversy. First, these cases are believed to be indicative of the marketisation of citizenship. Second, these cases challenge established discourses of national identity as the question ‘who may represent the nation?’ becomes contested. Using state-of-the-art machine learning techniques, I analysed 1534 English language newspaper articles about Olympic athletes who changed their nationalities (1978–2017). The results indicate: (i) that switching national allegiance has not necessarily become more controversial; (ii) that most media reports do not frame nationality switching in economic terms; and (iii) that nationality swapping often occurs fairly unnoticed. I therefore conclude that a marketisation of citizenship is less apparent in nationality switching than some claim. Moreover, nationality switches are often mentioned rather casually, indicating the generally banal character of nationalism. Only under certain conditions does ‘hot’ nationalism spark the issue of nationhood.

Download Full-text

Interjections

Linguistics ◽

10.1093/obo/9780199772810-0230 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alan Libert

Keyword(s):

Point Of View ◽

The Other ◽

Word Class ◽

Original Function ◽

Parts Of Speech ◽

Part Of Speech ◽

Semantics And Pragmatics ◽

Conservative Policy ◽

Long Time

Interjections are one of the traditional parts of speech (along with nouns, verbs, etc.), although some linguists have considered them not to be a part of language but rather instinctive reactions to a situation. The word interjection comes from the Latin interjicere “to throw between,” as they were seen as words that were tossed into a sentence, without being syntactically related to other items. Examples of English interjections are oh!, ah!, ugh!, and ouch! Interjections such as these, which are not (zero-)derived from words belonging to other parts of speech, and which have only an interjectional function, are called primary interjections; interjections that have evolved from words of other classes and which have retained their original function in addition to their new one are known as secondary interjections. Secondary interjections are often swear words, e.g. shit!, or religious terms, e.g. Jesus! Some (putative) interjections, interjectional phrases, consist of more than one word, e.g. my God!; they could be problematic for the view that interjections are a word class or part of speech. Interjections have received considerably less attention from linguists than the other parts of speech. This may be due, in part, to the just mentioned view that they are not really linguistic items and thus are of little or no interest from a linguistic point of view. However, to say that they have been neglected, as some authors do, is an overstatement; as can be seen in this article, scholars have been thinking and writing about different aspects of interjections for a long time (and note that this article mentions only works devoted (at least in large part) to interjections, not works on other subjects that also discuss interjections). Thus here one will see works on the phonetics/phonology, syntax, semantics, and pragmatics of interjections, among other subjects. There does, however, seem to be one gap in the literature: few, if any, papers focus on the morphology of interjections. A problem in compiling a bibliography on interjections is that authors disagree on what should be included in the set of interjections; for example, are onomatopoeias interjections (and thus should works on onomatopoeias be included in a bibliography on interjections)? In this article a conservative policy has been taken, and works dealing only with onomatopoeias (or greetings, etc.) have been excluded.

Download Full-text

Functional class (so called “part of speech”) assignment as a kind of meaning-bound word syntactic information

Cognitive Studies | Études cognitives ◽

10.11649/cs.2010.001 ◽

2015 ◽

pp. 15-33

Author(s):

Jadwiga Wajszczuk

Keyword(s):

Higher Order ◽

Functional Class ◽

Word Formation ◽

The Other ◽

Order Unit ◽

Parts Of Speech ◽

Part Of Speech ◽

Syntactic Information ◽

Theory Of Language ◽

The One

Functional class (so called “part of speech”) assignment as a kind of meaning-bound word syntactic informationThe traditional division of the lexicon into parts of speech which seems to satisfy the requirements of a syntactic description, on the one hand, and a word formation description, on the other hand, cannot be looked upon as a result of a strict classification covering the totality of the lexicon and being based on a coherent set of criteria. Making the criteria more precise or correcting them is an issue of extreme importance and urgency in the work on the theory of language. Such achievements can help solve many other problems, in particular, syntactic ones. The article presents a scheme of several preliminary steps of an amelioration program (a scheme which has been improved compared to the author’s earlier attempts going in the same direction). The program is based on combinability characteristics of words, i.e. on those properties that are responsible for the tasks to be accomplished by a given class of expressions in making up a higher order unit, i.e. a syntagm (the author emphasizes this point: it is syntagm rather than sentence which is the category the recommended approach is focusing on), and that, importantly, determine the limits of syntactic rules, i.e. the ins and outs of the rules (the limits concerning the overall stock of words).

Download Full-text

COMMON ERROR IN USE CONJUNCTION IN WRITING

SOCIETY ◽

10.20414/society.v11i1.2293 ◽

2020 ◽

Vol 11 (1) ◽

pp. 20-26

Author(s):

Najamuddin Najamuddin

Keyword(s):

Writing Process ◽

The Other ◽

Common Error ◽

Discourse Marker ◽

Parts Of Speech ◽

Part Of Speech ◽

Good Writing ◽

The Right

Conjunction is used to make a sentence which having the cohesion and coherence in text. The absence of the right conjunction will result in having illogical meaning, and the message. Because of the important role of conjunctions in the writing process, this study aims to reveal the students’ common errors on the use of conjunction in their writing, and investigate types of errors that occur most frequently in students’ writing. This part contains the necessary conjunction In grammar, a conjunction (abbreviated CONJ or CNJ) is a part of speech that connects words, phrases, or clauses that are called the conjuncts of the conjoining construction. The term discourse marker is mostly used for conjunctions joining sentences. This definition may overlap with that of other parts of speech, so what constitutes a “conjunction” must be defined for each language. In general, a conjunction is an invariable grammatical particle and it may or may not stand between the items in a conjunction. A knowledge of words, phrases, clauses is essential to good writing and speaking, but it doesn’t mean neglected the other part of grammar. The correction of writing and speaking is to concentrate on what we are saying rather than on how we are saying it.

Download Full-text

A STUDY OF ENGLISH VOCABULARY USED IN ONLINE LAO FOOD RECIPES

International Journal of Research -GRANTHAALAYAH ◽

10.29121/granthaalayah.v4.i7.2016.2616 ◽

2016 ◽

Vol 4 (7) ◽

pp. 240-247

Author(s):

Savith Vongsena ◽

Nutprapha Dennis

Keyword(s):

Data Analysis ◽

English Language ◽

Final Conclusion ◽

Total Work ◽

Word Count ◽

Teaching English ◽

Parts Of Speech ◽

Part Of Speech ◽

Use Of Data ◽

Usage Frequency

The purpose of this study is to analyze English vocabulary usage in online Lao recipes from a cooking website. The aim of this study is to analyze frequency of eight categories of parts of speech; noun, pronoun, adjective, verb, adverb, preposition, conjunction and determiner. This study is a survey research. The data collection is from 21 Lao recipes selected for analysis. The researcher maintained the use of data analysis by apparently using the frequency and percentage of various types of part of speech as a tool to use for teaching English language most effectively. According to the aims consisting of types of parts of speech divided into eight categories from an online Lao food recipe site entitled: “Lao recipe from a cooking website” was used for summa -rizing the usage frequency of parts of speech from 21 Lao recipes. In the final conclusion, that was taken from the result it was found that “Noun” is the highest frequency part of speech that was frequently used in recipe writing with a count of 1390 words that is a frequency percentage of 42.78%. That is twice the usage frequency of verb which is ranked second amongst the eight parts of speech. The analysis accounting of “Verb” found that there was a total verb word count of 594 words or 18.28%. In contrast, “Determiner” is the part of speech that was the least frequently used in selected recipes with the total work count of 46 words that is a frequency of 1.44%.

Download Full-text

Asymmetric Co-Teaching for Unsupervised Cross-Domain Person Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6950 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12597-12604 ◽

Cited By ~ 5

Author(s):

Fengxiang Yang ◽

Ke Li ◽

Zhun Zhong ◽

Zhiming Luo ◽

Xing Sun ◽

...

Keyword(s):

State Of The Art ◽

The Other ◽

Clustering Methods ◽

Target Domain ◽

Source Domain ◽

Cross Domain ◽

Training Samples ◽

Imaging Conditions ◽

Filtering Procedure ◽

Noisy Labels

Person re-identification (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i.e., source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low confidence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample filtering procedure after the clustering, the mined examples can be much more efficiently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently benefit most clustering based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https://github.com/FlyingRoastDuck/ACT_AAAI20.

Download Full-text

Adverbs in Contrast – Criteria for Distinguishing from Other Parts of Speech

Филолог – часопис за језик књижевност и културу ◽

10.21618/fil2022013p ◽

2020 ◽

Vol 22 (22) ◽

pp. 13-29

Author(s):

Bozinka Petronijevic

Keyword(s):

Word Formation ◽

The Other ◽

Parts Of Speech ◽

Specific Language ◽

Prepositional Phrases ◽

Part Of Speech ◽

Other Hand ◽

Universal Category ◽

Opposite Situation ◽

Formation Systems

This paper attempts at examining and determining, using the vast corpus of Serbian and German languages, whether the part of speech in question exists as such in each of these languages, as well as whether it (adverb) is a universal or specific language category. The research shows that most languages recognise the adverb as a distinctive part of speech, which implies that it is a universal category that can be defined according to the following criteria: a) morphological (adverbs have no flexions, but they undergo comparison with regard to the relative subclass) and syntactic (conditioned by verbs as nucleus, assuming in most cases the function of adverbials as verb complements; b) rare attributive function before nouns and adverbs themselves; c) differences between specific languages, German and Serbian included, are a result of their respective word formation systems. In this particular case, each of the two languages recognises relatively few simple words (simplizia); on the other hand, the explicit (suffixational) derivation is largely productive in Serbian, whereas there is a completely opposite situation in German concerning this issue (although the process is recorded in the latter as well); and, finally, adverb derivatives in Serbian correspond, as a rule, to adjectives and prepositional phrases functioning as adverbials in German.

Download Full-text

Working With Interpreters to Support Students Who Are English Language Learners

Perspectives of the ASHA Special Interest Groups ◽

10.1044/persp1.sig16.15 ◽

2016 ◽

Vol 1 (16) ◽

pp. 15-27 ◽

Cited By ~ 3

Author(s):

Henriette W. Langdon ◽

Terry Irvine Saenz

Keyword(s):

United States ◽

English Language Learners ◽

Language Learners ◽

English Language ◽

First Language ◽

The United States ◽

The Other ◽

Primary Language ◽

Second Best ◽

Support Students

The number of English Language Learners (ELL) is increasing in all regions of the United States. Although the majority (71%) speak Spanish as their first language, the other 29% may speak one of as many as 100 or more different languages. In spite of an increasing number of speech-language pathologists (SLPs) who can provide bilingual services, the likelihood of a match between a given student's primary language and an SLP's is rather minimal. The second best option is to work with a trained language interpreter in the student's language. However, very frequently, this interpreter may be bilingual but not trained to do the job.

Download Full-text