NONPARAMETRIC METHODS OF AUTHORSHIP ATTRIBUTION IN ENGLISH LITERATURE

Journal of Numerical and Applied Mathematics ◽

10.17721/2706-9699.2020.1.04 ◽

2020 ◽

pp. 50-58

Author(s):

D. A. Klyushin ◽

V. Yu. Mykhaylyuk

Keyword(s):

English Literature ◽

Nonparametric Methods ◽

Authorship Attribution ◽

Testing Methods ◽

Authorship Identification

The paper describes the results of comparison of two nonparametric methods of authorship identification in English literature. It describes testing methods with and without clustering. A method was also proposed to select the n-grams that would best serve as a marker to identify the author. More than 800 texts of 16 authors were used for testing. The method using the density of the distribution is suitable for identifying authors of both large texts (50000+ characters) and small (10000+ characters) ones. A method that uses p-statistics is only suitable for large texts.

Download Full-text

APPROACH FOR MINIMIZATION OF PHONEME GROUPS IN AUTHORSHIP ATTRIBUTION

International Journal of Computing ◽

10.47839/ijc.19.1.1693 ◽

2020 ◽

pp. 55-62

Author(s):

Iryna Khomytska ◽

Vasyl Teslyuk ◽

Iryna Bazylevych ◽

Inna Shylinska

Keyword(s):

Statistical Model ◽

Statistical Methods ◽

Test Validity ◽

Authorship Attribution ◽

Java Programming ◽

Platform Independence ◽

Combination Of Methods ◽

Student’S T ◽

Authorship Identification ◽

Consonant Phoneme

The developed mathematical support for authorship attribution software includes a combination of statistical methods (Student’s t-test, Kolmogorov-Smirnov’s test) and a statistical model for determining significant differences between styles. The combination of statistical methods allows us to enhance test validity of authorship attribution by obtaining the same results by the two methods applied. The model developed makes it possible to identify a consonant phoneme group with high style identification capability. The phoneme position in a word is taken into account. The greater number of significant differences is, the higher authorship identification capability of the phoneme group is. The developed system software is based on the algorithms of the used combination of methods and statistical model. The Java programming language provides platform independence. The minimized number of consonant phoneme groups makes the process of style and authorship attribution more automated. The obtained results of comparisons of the scientific, belles-lettres, conversational and newspaper styles are presented. The data obtained allows us to assert that the used combination of methods and the developed statistical model improve test validity of style and authorship attribution.

Download Full-text

Authorship Attribution for Online Social Media

Advances in Business Information Systems and Analytics - Social Network Analytics for Contemporary Business Organizations ◽

10.4018/978-1-5225-5097-6.ch008 ◽

2018 ◽

pp. 141-165 ◽

Cited By ~ 1

Author(s):

Ritu Banga ◽

Akanksha Bhardwaj ◽

Sheng-Lung Peng ◽

Gulshan Shrivastava

Keyword(s):

Machine Learning ◽

Social Media ◽

Authorship Attribution ◽

The Internet ◽

Writing Style ◽

Machine Learning Classifiers ◽

Online Social Media ◽

Comprehensive Knowledge ◽

Authorship Identification ◽

Cyber Ethics

This chapter gives a comprehensive knowledge of various machine learning classifiers to achieve authorship attribution (AA) on short texts, specifically tweets. The need for authorship identification is due to the increasing crime on the internet, which breach cyber ethics by raising the level of anonymity. AA of online messages has witnessed interest from many research communities. Many methods such as statistical and computational have been proposed by linguistics and researchers to identify an author from their writing style. Various ways of extracting and selecting features on the basis of dataset have been reviewed. The authors focused on n-grams features as they proved to be very effective in identifying the true author from a given list of known authors. The study has demonstrated that AA is achievable on the basis of selection criteria of features and methods in small texts and also proved the accuracy of analysis changes according to combination of features. The authors found character grams are good features for identifying the author but are not yet able to identify the author independently.

Download Full-text

Large-scale and Robust Code Authorship Identification with Deep Feature Learning

ACM Transactions on Privacy and Security ◽

10.1145/3461666 ◽

2021 ◽

Vol 24 (4) ◽

pp. 1-35

Author(s):

Mohammed Abuhamad ◽

Tamer Abuhmed ◽

David Mohaisen ◽

Daehun Nyang

Keyword(s):

Programming Languages ◽

Real World ◽

Large Scale ◽

Source Code ◽

Feature Learning ◽

Identification Accuracy ◽

Authorship Attribution ◽

Deep Feature ◽

Public Repositories ◽

Authorship Identification

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

Download Full-text