scholarly journals Computational stylistics and authorship attribution: what it measures and why it works

2020 ◽  
Author(s):  
Jeremi Ochab

The topic of this thesis is the computational methods for measurement of authorialstyle and algorithms of authorial attribution.The first aim of the thesis was an attempt at a quantifiable separation of various layers of authorial style (in the present case the lexical and grammatical layers) in order to estimate their influence on the results of a chosen method of authorial attribution. Within the scope of these studies I compared the distance, so called Burrows's Delta, between a pair of English novels by two chosen authors and automatically generated texts, whose statistical distributions of parts of speech were borrowed from one of the authors, while the vocabulary from the other one; additionally, in the computatrificial texts I left the sets of words of the first author if they belonged to a particular part of speech. Such procedure allowed to create a hybrid text, which was attributed to the first author, even though the majority of lexical items were that of the second author.The second aim was to identify the influences of the style and language of the original on the style of the translation. This part of research involved among others adapting Polish and English part of speech tag sets to form a common translatorial tag set. Beside making a couple of simple observations concerning the distributions and coocurrences of parts of speech in the two languages, I managed to determine some features of the selected translatorial corpus, which lie on the fringes of what seems a norm for Polish.The third aim was testing the accuracy of state of the art (unsupervised) clustering methods for automatic grouping of texts according to their author. The results show that the methods recognise authorship worse than the known supervised machine learning methods.In the thesis I made use of corpora totalling around 550 digitised English language novels and 100 Polish ones, as well as a parallel corpus of 39 novels of a single English author together with their translations by a single Polish translator. The research conducted involved utilising existing part of speech taggers (both for English and Polish), authorship attribution programmes, and programmes for graph clustering.

2015 ◽  
Vol 28 (33) ◽  
pp. 89-98
Author(s):  
Kazimierz Luciński

The paper focuses on a loanword borrowed from the English language “fake” that became very popular in the Russian soil. The author shows the derivational abilities of a word leading to the formation of new words that belong to the other parts of speech: a verb “фейковать”, an adjective “фейковый”, a noun “фейковость”. Every part of speech is analysed on the basis of paradigmatic relations, in which the word is involved, along with its sociolinguistic characteristics such as the field of use and social strata to which the native speakers belong. The author does not limit himself to an ordinary linguistic description of this loanword and its sense-correlates; instead the author tries to present socio-cultural peculiarities of reality that made possible a wide use of this loanword borrowed from the English language. 


2018 ◽  
Vol 2018 (1) ◽  
pp. 127-144 ◽  
Author(s):  
Lucy Simko ◽  
Luke Zettlemoyer ◽  
Tadayoshi Kohno

Abstract Source code attribution classifiers have recently become powerful. We consider the possibility that an adversary could craft code with the intention of causing a misclassification, i.e., creating a forgery of another author’s programming style in order to hide the forger’s own identity or blame the other author. We find that it is possible for a non-expert adversary to defeat such a system. In order to inform the design of adversarially resistant source code attribution classifiers, we conduct two studies with C/C++ programmers to explore the potential tactics and capabilities both of such adversaries and, conversely, of human analysts doing source code authorship attribution. Through the quantitative and qualitative analysis of these studies, we (1) evaluate a state-of-the-art machine classifier against forgeries, (2) evaluate programmers as human analysts/forgery detectors, and (3) compile a set of modifications made to create forgeries. Based on our analyses, we then suggest features that future source code attribution systems might incorporate in order to be adversarially resistant.


2018 ◽  
Vol 54 (8) ◽  
pp. 971-988
Author(s):  
Joost Jansen

While the practice of nationality swapping in sports traces back as far as the Ancient Olympics, it seems to have increased over the past decades. Cases of Olympic athletes who switched their national allegiances are often surrounded with controversy. Two strands of thought could help explain this controversy. First, these cases are believed to be indicative of the marketisation of citizenship. Second, these cases challenge established discourses of national identity as the question ‘who may represent the nation?’ becomes contested. Using state-of-the-art machine learning techniques, I analysed 1534 English language newspaper articles about Olympic athletes who changed their nationalities (1978–2017). The results indicate: (i) that switching national allegiance has not necessarily become more controversial; (ii) that most media reports do not frame nationality switching in economic terms; and (iii) that nationality swapping often occurs fairly unnoticed. I therefore conclude that a marketisation of citizenship is less apparent in nationality switching than some claim. Moreover, nationality switches are often mentioned rather casually, indicating the generally banal character of nationalism. Only under certain conditions does ‘hot’ nationalism spark the issue of nationhood.


Author(s):  
Alan Libert

Interjections are one of the traditional parts of speech (along with nouns, verbs, etc.), although some linguists have considered them not to be a part of language but rather instinctive reactions to a situation. The word interjection comes from the Latin interjicere “to throw between,” as they were seen as words that were tossed into a sentence, without being syntactically related to other items. Examples of English interjections are oh!, ah!, ugh!, and ouch! Interjections such as these, which are not (zero-)derived from words belonging to other parts of speech, and which have only an interjectional function, are called primary interjections; interjections that have evolved from words of other classes and which have retained their original function in addition to their new one are known as secondary interjections. Secondary interjections are often swear words, e.g. shit!, or religious terms, e.g. Jesus! Some (putative) interjections, interjectional phrases, consist of more than one word, e.g. my God!; they could be problematic for the view that interjections are a word class or part of speech. Interjections have received considerably less attention from linguists than the other parts of speech. This may be due, in part, to the just mentioned view that they are not really linguistic items and thus are of little or no interest from a linguistic point of view. However, to say that they have been neglected, as some authors do, is an overstatement; as can be seen in this article, scholars have been thinking and writing about different aspects of interjections for a long time (and note that this article mentions only works devoted (at least in large part) to interjections, not works on other subjects that also discuss interjections). Thus here one will see works on the phonetics/phonology, syntax, semantics, and pragmatics of interjections, among other subjects. There does, however, seem to be one gap in the literature: few, if any, papers focus on the morphology of interjections. A problem in compiling a bibliography on interjections is that authors disagree on what should be included in the set of interjections; for example, are onomatopoeias interjections (and thus should works on onomatopoeias be included in a bibliography on interjections)? In this article a conservative policy has been taken, and works dealing only with onomatopoeias (or greetings, etc.) have been excluded.


2015 ◽  
pp. 15-33
Author(s):  
Jadwiga Wajszczuk

Functional class (so called “part of speech”) assignment as a kind of meaning-bound word syntactic informationThe traditional division of the lexicon into parts of speech which seems to satisfy the requirements of a syntactic description, on the one hand, and a word formation description, on the other hand, cannot be looked upon as a result of a strict classification covering the totality of the lexicon and being based on a coherent set of criteria. Making the criteria more precise or correcting them is an issue of extreme importance and urgency in the work on the theory of language. Such achievements can help solve many other problems, in particular, syntactic ones. The article presents a scheme of several preliminary steps of an amelioration program (a scheme which has been improved compared to the author’s earlier attempts going in the same direction). The program is based on combinability characteristics of words, i.e. on those properties that are responsible for the tasks to be accomplished by a given class of expressions in making up a higher order unit, i.e. a syntagm (the author emphasizes this point: it is syntagm rather than sentence which is the category the recommended approach is focusing on), and that, importantly, determine the limits of syntactic rules, i.e. the ins and outs of the rules (the limits concerning the overall stock of words).


SOCIETY ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 20-26
Author(s):  
Najamuddin Najamuddin

Conjunction is used to make a sentence which having the cohesion and coherence in text. The absence of the right conjunction will result in having illogical meaning, and the message. Because of the important role of conjunctions in the writing process, this study aims to reveal the students’ common errors on the use of conjunction in their writing, and investigate types of errors that occur most frequently in students’ writing. This part contains the necessary conjunction In grammar, a conjunction (abbreviated CONJ or CNJ) is a part of speech that connects words, phrases, or clauses that are called the conjuncts of the conjoining construction. The term discourse marker is mostly used for conjunctions joining sentences. This definition may overlap with that of other parts of speech, so what constitutes a “conjunction” must be defined for each language. In general, a conjunction is an invariable grammatical particle and it may or may not stand between the items in a conjunction. A knowledge of words, phrases, clauses is essential to good writing and speaking, but it doesn’t mean neglected the other part of grammar. The correction of writing and speaking is to concentrate on what we are saying rather than on how we are saying it.


2016 ◽  
Vol 4 (7) ◽  
pp. 240-247
Author(s):  
Savith Vongsena ◽  
Nutprapha Dennis

The purpose of this study is to analyze English vocabulary usage in online Lao recipes from a cooking website. The aim of this study is to analyze frequency of eight categories of parts of speech; noun, pronoun, adjective, verb, adverb, preposition, conjunction and determiner. This study is a survey research. The data collection is from 21 Lao recipes selected for analysis. The researcher maintained the use of data analysis by apparently using the frequency and percentage of various types of part of speech as a tool to use for teaching English language most effectively. According to the aims consisting of types of parts of speech divided into eight categories from an online Lao food recipe site entitled: “Lao recipe from a cooking website” was used for summa    -rizing the usage frequency of parts of speech from 21 Lao recipes.   In the final conclusion, that was taken from the result it was found that “Noun” is the highest frequency part of speech that was frequently used in recipe writing with a count of 1390 words that is a frequency percentage of 42.78%. That is twice the usage frequency of verb which is ranked second amongst the eight parts of speech. The analysis accounting of “Verb” found that there was a total verb word count of 594 words or 18.28%. In contrast, “Determiner” is the part of speech that was the least frequently used in selected recipes with the total work count of 46 words that is a frequency of 1.44%.


2020 ◽  
Vol 34 (07) ◽  
pp. 12597-12604 ◽  
Author(s):  
Fengxiang Yang ◽  
Ke Li ◽  
Zhun Zhong ◽  
Zhiming Luo ◽  
Xing Sun ◽  
...  

Person re-identification (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i.e., source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low confidence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample filtering procedure after the clustering, the mined examples can be much more efficiently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently benefit most clustering based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https://github.com/FlyingRoastDuck/ACT_AAAI20.


Author(s):  
Bozinka Petronijevic

This paper attempts at examining and determining, using the vast corpus of Serbian and German languages, whether the part of speech in question exists as such in each of these languages, as well as whether it (adverb) is a universal or specific language category. The research shows that most languages recognise the adverb as a distinctive part of speech, which implies that it is a universal category that can be defined according to the following criteria: a) morphological (adverbs have no flexions, but they undergo comparison with regard to the relative subclass) and syntactic (conditioned by verbs as nucleus, assuming in most cases the function of adverbials as verb complements; b) rare attributive function before nouns and adverbs themselves; c) differences between specific languages, German and Serbian included, are a result of their respective word formation systems. In this particular case, each of the two languages recognises relatively few simple words (simplizia); on the other hand, the explicit (suffixational) derivation is largely productive in Serbian, whereas there is a completely opposite situation in German concerning this issue (although the process is recorded in the latter as well); and, finally, adverb derivatives in Serbian correspond, as a rule, to adjectives and prepositional phrases functioning as adverbials in German.


2016 ◽  
Vol 1 (16) ◽  
pp. 15-27 ◽  
Author(s):  
Henriette W. Langdon ◽  
Terry Irvine Saenz

The number of English Language Learners (ELL) is increasing in all regions of the United States. Although the majority (71%) speak Spanish as their first language, the other 29% may speak one of as many as 100 or more different languages. In spite of an increasing number of speech-language pathologists (SLPs) who can provide bilingual services, the likelihood of a match between a given student's primary language and an SLP's is rather minimal. The second best option is to work with a trained language interpreter in the student's language. However, very frequently, this interpreter may be bilingual but not trained to do the job.


Sign in / Sign up

Export Citation Format

Share Document