Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach

Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

Download Full-text

Multi-χ: Identifying Multiple Authors from Source Code Files

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0044 ◽

2020 ◽

Vol 2020 (3) ◽

pp. 25-41

Author(s):

Mohammed Abuhamad ◽

Tamer Abuhmed ◽

DaeHun Nyang ◽

David Mohaisen

Keyword(s):

Source Code ◽

Compact Representation ◽

Integration Process ◽

Software Projects ◽

Identification Schemes ◽

Author Identification ◽

Multiple Dimensions ◽

Authorship Identification ◽

Verification Model ◽

Tree Extraction

AbstractMost authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.

Download Full-text

Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

Source Code Author Identification Method Combining Semantics and Statistical Features

Author Identification of Software Source Code with Program Dependence Graphs

Author Identification in Imbalanced Sets of Source Code Samples

Deep Neural Networks for Source Code Author Identification

On the Use of Discretized Source Code Metrics for Author Identification

Source code author identification with unsupervised feature learning

Source Code Author Identification Based on N-gram Author Profiles

Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Multi-χ: Identifying Multiple Authors from Source Code Files

Export Citation Format