The syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this article, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the
-to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.
In this paper, we present authorship attribution methods applied to ¡El Mondrigo! (1968), a controversial text supposedly created by order of the Mexican Government to defame a student strike. Up to now, although the authorship of the book has been attributed to several journalists and writers, it could not be demonstrated and remains an open problem. The work aims at establishing which one of the most commonly attributed writers is the real author. To do that, we implement methods based on stylometric features using textual distance, supervised, and unsupervised learning. The distance-based methods implemented in this work are Kilgarriff and Delta of Burrows, an SVM algorithm is used as the supervised method, and the k-means algorithm as the unsupervised algorithm. The applied methods were consistent by pointing out a single author as the most likely one.
Authorship attribution is one of the important fields of natural language processing (NLP). Its popularity is due to the relevance of implementing solutions for information security, as well as copyright protection, various linguistic studies, in particular, researches of social networks. The article is a continuation of the series of studies aimed at the identification of the Russian-language text’s author and reducing the required text volume. The focus of the study was aimed at the attribution of textual data created as a product of human online activity. The effectiveness of the models was evaluated on the two Russian-language datasets: literary texts and short comments from users of social networks. Classical machine learning (ML) algorithms, popular neural networks (NN) architectures, and their hybrids, including convolutional neural network (CNN), networks with long short-term memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and fastText, that have not been used in previous studies, were applied to solve the problem. A particular experiment was devoted to the selection of informative features using genetic algorithms (GA) and evaluation of the classifier trained on the optimal feature space. Using fastText or a combination of support vector machine (SVM) with GA reduced the time costs by half in comparison with deep NNs with comparable accuracy. The average accuracy for literary texts was 80.4% using SVM combined with GA, 82.3% using deep NNs, and 82.1% using fastText. For social media comments, results were 66.3%, 73.2%, and 68.1%, respectively.
This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character n-grams. The Weighted cosine similarity measure is applied to improve the B 3 F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.
Source Code Authorship Attribution is a problem that is lately studied more often due improvements in Deep Learning techniques. Among existing solutions, two common issues are inability to add new authors without retraining and lack of interpretability. We address both these problem. In our experiments, we were able to correctly classify 75% of authors for diferent programming languages. Additionally, we applied techniques of explainable AI (XAI) and found that our model seems to pay attention to distinctive features of source code.
Este artículo presenta un caso de atribución de autoría en el ámbito de la lingüística forense a partir de la herramienta ALTXA. Primeramente, se aportará una definición general de la lingüística forense y una breve explicación de sus principales áreas de estudio con el propósito de acotar progresivamente el foco de la investigación hasta llegar a los estudios de atribución de autoría, los cuales serán abordados con mayor profundidad. En segundo lugar, el artículo evaluará las principales herramientas computacionales empleadas para determinar de forma cuantitativa la autoría de textos anónimos o disputados y presentará el software ALTXA, el cual ha sido creado por el presente grupo investigador. Dicha herramienta computacional pretende aunar las funcionalidades de distintos programas informáticos en una interfaz accesible que permita la implementación de los estudios de atribución de autoría en contextos educativos y facilite la labor del lingüista forense. El artículo concluirá con una demostración práctica de ALTXA en la que se realizará un estudio de atribución de autoría de un fragmento indubitado de William Shakespeare para demostrar la validez del programa, así como de los estudios de n-grams, una de las funcionalidades que ofrece dicha herramienta y que constituyen un procedimiento metodológico consolidado en el campo de la lingüística forense.
Lingüística forense, lingüística computacional, atribución de autoría, William Shakespeare, n-grams.
This article will present a case of authorship attribution within the framework of forensic linguistics with the computational tool ALTXA. For such end, a general definition of forensic linguistics and an explanation of its main areas of study will be offered with the aim of narrowing down progressively the scope of the article until authorship attribution studies are presented and discussed in more depth. Afterwards, a review of the main computational tools with which the authorship of disputed or anonymous texts are analysed will be provided, and ALTXA, a software that has been developed by the researcher, will be presented. Such tool combines many of the functionalities offered by other programs in an intuitive interface that allows for the implementation of authorship attribution studies in educational settings and facilitates the labour of the forensic linguist. Lastly, the article will provide a practical demonstration of ALTXA in which the authorship of an undisputed text written by William Shakespeare will be analysed to prove its reliability. Such analysis will consist of an n-gram study, which is one of the functionalities of ALTXA and constitutes a solid methodological procedure within the framework of forensic linguistics.
Forensic linguistics, computational linguistics, authorship attribution, William Shakespeare, n-grams.