Multi-χ: Identifying Multiple Authors from Source Code Files

AbstractMost authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.

Download Full-text

Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Handbook of Research on Computational Forensics, Digital Crime, and Investigation - Advances in Digital Crime, Forensics, and Cyber Terrorism ◽

10.4018/978-1-60566-836-9.ch020 ◽

2010 ◽

pp. 470-495 ◽

Cited By ~ 5

Author(s):

Georgia Frantzeskou ◽

Stephen G. MacDonell ◽

Efstathios Stamatatos

Keyword(s):

Credit Card ◽

Source Code ◽

Cyber Attacks ◽

Data Sets ◽

Author Identification ◽

Proposed Model ◽

N Gram ◽

Authorship Identification ◽

Investigation Process ◽

Authorship Analysis

Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

Download Full-text

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

2021 13th International Conference on Machine Learning and Computing ◽

10.1145/3457682.3457709 ◽

2021 ◽

Author(s):

Pranali Bora ◽

Tulika Awalgaonkar ◽

Himanshu Palve ◽

Raviraj Joshi ◽

Purvi Goel

Keyword(s):

Neural Network ◽

Source Code ◽

Network Approach ◽

Neural Network Approach ◽

Author Identification ◽

Hierarchical Neural Network

Download Full-text

A Practical Black-box Attack on Source Code Authorship Identification Classifiers

IEEE Transactions on Information Forensics and Security ◽

10.1109/tifs.2021.3080507 ◽

2021 ◽

pp. 1-1

Author(s):

Qianjun Liu ◽

Shouling Ji ◽

Changchang Liu ◽

Chunming Wu

Keyword(s):

Source Code ◽

Black Box ◽

Authorship Identification

Download Full-text

Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde

Forum for Information Retrieval Evaluation ◽

10.1145/3441501.3441532 ◽

2020 ◽

Author(s):

Ali Fadel ◽

Husam Musleh ◽

Ibraheem Tuffaha ◽

Mahmoud Al-Ayyoub ◽

Yaser Jararweh ◽

...

Keyword(s):

Source Code ◽

Authorship Identification

Download Full-text

A Study of the Relationships between Source Code Metrics and Attractiveness in Free Software Projects

2010 Brazilian Symposium on Software Engineering ◽

10.1109/sbes.2010.27 ◽

2010 ◽

Cited By ~ 18

Author(s):

Paulo Meirelles ◽

Carlos Santos Jr. ◽

Joao Miranda ◽

Fabio Kon ◽

Antonio Terceiro ◽

...

Keyword(s):

Source Code ◽

Free Software ◽

Software Projects ◽

Code Metrics ◽

Source Code Metrics

Download Full-text

Synthesis of Code Anomalies: Revealing Design Problems in the Source Code

10.5753/ctd.2016.9131 ◽

2020 ◽

Author(s):

Willian N. Oizumi ◽

Alessandro F. Garcia

Keyword(s):

Software Engineering ◽

Source Code ◽

New Technique ◽

Software Systems ◽

Identification Task ◽

Software Projects ◽

Design Problems ◽

Engineering Community ◽

Synthesis Technique ◽

A New Technique

Design problems affect most software projects and make their maintenance expensive and impeditive. Thus, the identification of potential design problems in the source code – which is very often the only available and upto-date artifact in a project – becomes essential in long-living software systems. This identification task is challenging as the reification of design problems in the source code tend to be scattered through several code elements. However, stateof-the-art techniques do not provide enough information to effectively help developers in this task. In this work, we address this challenge by proposing a new technique to support developers in revealing design problems. This technique synthesizes information about potential design problems, which are materialized in the implementation under the form of syntactic and semantic anomaly agglomerations. Our evaluation shows that the proposed synthesis technique helps to reveal more than 1200 design problems across 7 industry-strength systems, with a median precision of 71% and a median recall of 78%. The relevance of our work has been widely recognized by the software engineering community through 2 awards and 7 publications in international and national venues.

Download Full-text

Tools and Datasets for Mining Libre Software Repositories

Multi-Disciplinary Advancement in Open Source Software and Processes ◽

10.4018/978-1-60960-513-1.ch002 ◽

2011 ◽

pp. 24-42 ◽

Cited By ~ 2

Author(s):

Gregorio Robles ◽

Jesús M. González-Barahona ◽

Daniel Izquierdo-Cortazar ◽

Israel Herraiz

Keyword(s):

Source Code ◽

Data Sources ◽

The Internet ◽

Software Projects ◽

Free Open Source Software ◽

Mailing Lists ◽

Bug Tracking ◽

Libre Software ◽

Free Open Source ◽

Open Nature

Thanks to the open nature of libre (free, open source) software projects, researchers have gained access to a rich set of data related to various aspects of software development. Although it is usually publicly available on the Internet, obtaining and analyzing the data in a convenient way is not an easy task, and many considerations have to be taken into account. In this chapter we introduce the most relevant data sources that can be found in libre software projects and that are commonly studied by scholars: source code releases, source code management systems, mailing lists and issue (bug) tracking systems. The chapter also provides some advice on the problems that can be found when retrieving and preparing the data sources for a later analysis, as well as information about the tools and datasets that support these tasks.

Download Full-text

Source Code Author Identification Method Combining Semantics and Statistical Features

Business Intelligence and Information Technology - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-3-030-92632-8_14 ◽

2021 ◽

pp. 141-151

Author(s):

Xu Sun ◽

Yutong Sun ◽

Leilei Kong ◽

Yong Han ◽

Hui Ning

Keyword(s):

Source Code ◽

Statistical Features ◽

Identification Method ◽

Author Identification

Download Full-text

Open Source in Government

Encyclopedia of Digital Government ◽

10.4018/978-1-59140-789-8.ch195 ◽

2011 ◽

pp. 1287-1290 ◽

Cited By ~ 3

Author(s):

D. Berry

Keyword(s):

Open Source ◽

Open Source Software ◽

General Public ◽

Source Code ◽

Computer Software ◽

Free Software ◽

Software Projects ◽

Software Companies ◽

Develop Software ◽

Software Distribution

Open source software (OSS) is computer software that has its underlying source code made available under a licence. This can allow developers and users to adapt and improve it (Raymond, 2001). Computer software can be broadly split into two development models: • Proprietary, or closed software, owned by a company or individual. Copies of the binary are made public; the source code is not usually made public. • Open-source software (OSS), where the source code is released with the binary. Users and developers can be licenced to use and modify the code, and to distribute any improvements they make. Both OSS and proprietary approaches allow companies to make a profit. Companies developing proprietary software make money by developing software and then selling licences to use the software. For example, Microsoft receives a payment for every copy of Windows sold with a personal computer. OSS companies make their money by providing services, such as advising clients on the GPL licence. The licencee can either charge a fee for this service or work free of charge. In practice, software companies often develop both types of software. OSS is developed by an ongoing, iterative process where people share the ideas expressed in the source code. The aim is that a large community of developers and users can contribute to the development of the code, check it for errors and bugs, and make the improved version available to others. Project management software is used to allow developers to keep track of the various versions. There are two main types of open-source licences (although there are many variants and subtypes developed by other companies): • Berkeley Software Distribution (BSD) Licence: This permits a licencee to “close” a version (by withholding the most recent modifications to the source code) and sell it as a proprietary product; • GNU General Public Licence (GNU, GPL, or GPL): Under this licence, licencees may not “close” versions. The licencee may modify, copy, and redistribute any derivative version, under the same GPL licence. The licencee can either charge a fee for this service or work free of charge. Free software first evolved during the 1970s but in the 1990s forked into two movements, namely free software and open source (Berry, 2004). Richard Stallman, an American software developer who believes that sharing source code and ideas is fundamental to freedom of speech, developed a free version of the widely used Unix operating system. The resulting GNU program was released under a specially created General Public Licence (GNU, GPL). This was designed to ensure that the source code would remain openly available to all. It was not intended to prevent commercial usage or distribution (Stallman, 2002). This approach was christened free software. In this context, free meant that anyone could modify the software. However, the term “free” was often misunderstood to mean no cost. Hence, during the 1990s, Eric Raymond and others proposed that open-source software was coined as a less contentious and more business-friendly term. This has become widely accepted within the software and business communities; however there are still arguments about the most appropriate term to use (Moody, 2002). The OSMs are usually organised into a network of individuals who work collaboratively on the Internet, developing major software projects that sometimes rival commercial software but are always committed to the production of quality alternatives to those produced by commercial companies (Raymond, 2001; Williams, 2002). Groups and individuals develop software to meet their own and others’ needs in a highly decentralised way, likened to a Bazaar (Raymond, 2001). These groups often make substantive value claims to support their projects and foster an ethic of community, collaboration, deliberation, and intellectual freedom. In addition, it is argued by Lessig (1999) that the FLOSS community can offer an inspiration in their commitment to transparency in their products and their ability to open up governmental regulation and control through free/libre and open source code.

Download Full-text

Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9207647 ◽

2020 ◽

Author(s):

Pegah Hozhabrierdi ◽

Dunai Fuentes Hitos ◽

Chilukuri K. Mohan

Keyword(s):

Source Code ◽

Author Identification

Download Full-text