authorship identification
Recently Published Documents


TOTAL DOCUMENTS

112
(FIVE YEARS 40)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Raheem Sarwar ◽  
Saeed-Ul Hassan

The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.


2021 ◽  
pp. 177-194
Author(s):  
Tarun Kumar ◽  
S. Gowtham ◽  
Udit Kr. Chakraborty

2021 ◽  
Vol 24 (4) ◽  
pp. 1-35
Author(s):  
Mohammed Abuhamad ◽  
Tamer Abuhmed ◽  
David Mohaisen ◽  
Daehun Nyang

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.


Author(s):  
Anastasiya Gromova

The article discusses the texts of Internet-based communication and correspondence in a messenger, the attention is paid to description of similarities and differences between oral and written dialogical speech. The problems of neutralizing speech features in the format of Internet-based communication and the transformation of attributes, which provide individual characteristic of the author demonstrated whilst exchanging messages in a messenger. It has been proposed to define the form of speech, typical of correspondence in the messenger, as dialogical written (printed) speech, with reference to it as to the product of intellectual activity in combination with the form of its implementation, taking into account the factor of the author's usage of technical means for typing. The author represents the approaches to identifying significant speech characteristics, which are demonstrated by the addresser in written correspondence in the messenger; these approaches are often analyzed in the process of authorship identification tests. The possibility of revealing a complex of author's individualizing features is proved. This paper emphasizes the importance of studying the signs of the graphic and communicative levels of the analysis of dialogical texts, provides the examples of implementation of such signs. In this article the relevance of combining linguistic and quantitative methods of analysis in revealing the author's individualizing identificational features is proved, the paper also outlines the prospects for further research in the field of studying the linguistic personality of the digital age.


2021 ◽  
pp. 203-218
Author(s):  
Inesa Szulska

The present article analyses the Lithuanian translation of Henryk Sienkiewicz’s novel Szkice węglem (1876), published in 1890 by „Ukinįkas” („Ūkininkas”) magazine. Themes covered: historical and cultural as well as social and pragmatic contexts of Paisziniai anglimis; the scope of adaptation work in order to create a simplified version for less educated/culturally developed readers; translation authorship identification, strategies, techniques and selections used by the translators. The characteristic of translation process and the final lexical, grammatical and semantical structure of the translation is concluded by considering the role of this translation in Lithuanian historical-literary process at the end of 19th century and beginning of 20th century from a theoretical perspective.


2021 ◽  
Vol 3 (3) ◽  
Author(s):  
K. A. Apoorva ◽  
S. Sangeetha

AbstractElectronic mail is the primary source of different cyber scams. Identifying the author of electronic mail is essential. It forms significant documentary evidence in the field of digital forensics. This paper presents a model for email author identification (or) attribution by utilizing deep neural networks and model-based clustering techniques. It is perceived that stylometry features in the authorship identification have gained a lot of importance as it enhances the author attribution task's accuracy. The experiments were performed on a publicly available benchmark Enron dataset, considering many authors. The proposed model achieves an accuracy of 94% on five authors, 90% on ten authors, 86% on 25 authors and 75% on the entire dataset for the Deep Neural Network technique, which is a good measure of accuracy on a highly imbalanced data. The second cluster-based technique yielded an excellent 86% accuracy on the entire dataset, considering the authors' number based on their contribution to the aggregate data.


Author(s):  
Wong Yee Leng ◽  
Siti Mariyam Shamsuddin ◽  
Nor Azman Hashim

Writer identification based on cursive words is one of the extensive behavioural biometric that has involved many researchers to work in. Recently, its main idea is in forensic investigation and biometric analysis as such the handwriting style can be used as individual behavioural adaptation for authenticating an author. In this study, a novel approach of presenting cursive features of authors is presented. The invariants-based discriminability of the features is proposed by discretizing the moment features of each writer using biometric invariant discretization cutting point (BIDCP). BIDCP is introduced for features perseverance to obtain better individual representations and discriminations. Our experiments have revealed that by using the proposed method, the authorship identification based on cursive words is significantly increased with an average identification rate of 99.80%.


Sign in / Sign up

Export Citation Format

Share Document