scholarly journals Multi-χ: Identifying Multiple Authors from Source Code Files

2020 ◽  
Vol 2020 (3) ◽  
pp. 25-41
Author(s):  
Mohammed Abuhamad ◽  
Tamer Abuhmed ◽  
DaeHun Nyang ◽  
David Mohaisen

AbstractMost authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.

Author(s):  
Georgia Frantzeskou ◽  
Stephen G. MacDonell ◽  
Efstathios Stamatatos

Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.


Author(s):  
Ali Fadel ◽  
Husam Musleh ◽  
Ibraheem Tuffaha ◽  
Mahmoud Al-Ayyoub ◽  
Yaser Jararweh ◽  
...  

Author(s):  
Paulo Meirelles ◽  
Carlos Santos Jr. ◽  
Joao Miranda ◽  
Fabio Kon ◽  
Antonio Terceiro ◽  
...  

2020 ◽  
Author(s):  
Willian N. Oizumi ◽  
Alessandro F. Garcia

Design problems affect most software projects and make their maintenance expensive and impeditive. Thus, the identification of potential design problems in the source code – which is very often the only available and upto-date artifact in a project – becomes essential in long-living software systems. This identification task is challenging as the reification of design problems in the source code tend to be scattered through several code elements. However, stateof-the-art techniques do not provide enough information to effectively help developers in this task. In this work, we address this challenge by proposing a new technique to support developers in revealing design problems. This technique synthesizes information about potential design problems, which are materialized in the implementation under the form of syntactic and semantic anomaly agglomerations. Our evaluation shows that the proposed synthesis technique helps to reveal more than 1200 design problems across 7 industry-strength systems, with a median precision of 71% and a median recall of 78%. The relevance of our work has been widely recognized by the software engineering community through 2 awards and 7 publications in international and national venues.


Author(s):  
Gregorio Robles ◽  
Jesús M. González-Barahona ◽  
Daniel Izquierdo-Cortazar ◽  
Israel Herraiz

Thanks to the open nature of libre (free, open source) software projects, researchers have gained access to a rich set of data related to various aspects of software development. Although it is usually publicly available on the Internet, obtaining and analyzing the data in a convenient way is not an easy task, and many considerations have to be taken into account. In this chapter we introduce the most relevant data sources that can be found in libre software projects and that are commonly studied by scholars: source code releases, source code management systems, mailing lists and issue (bug) tracking systems. The chapter also provides some advice on the problems that can be found when retrieving and preparing the data sources for a later analysis, as well as information about the tools and datasets that support these tasks.


Author(s):  
D. Berry

Open source software (OSS) is computer software that has its underlying source code made available under a licence. This can allow developers and users to adapt and improve it (Raymond, 2001). Computer software can be broadly split into two development models: • Proprietary, or closed software, owned by a company or individual. Copies of the binary are made public; the source code is not usually made public. • Open-source software (OSS), where the source code is released with the binary. Users and developers can be licenced to use and modify the code, and to distribute any improvements they make. Both OSS and proprietary approaches allow companies to make a profit. Companies developing proprietary software make money by developing software and then selling licences to use the software. For example, Microsoft receives a payment for every copy of Windows sold with a personal computer. OSS companies make their money by providing services, such as advising clients on the GPL licence. The licencee can either charge a fee for this service or work free of charge. In practice, software companies often develop both types of software. OSS is developed by an ongoing, iterative process where people share the ideas expressed in the source code. The aim is that a large community of developers and users can contribute to the development of the code, check it for errors and bugs, and make the improved version available to others. Project management software is used to allow developers to keep track of the various versions. There are two main types of open-source licences (although there are many variants and subtypes developed by other companies): • Berkeley Software Distribution (BSD) Licence: This permits a licencee to “close” a version (by withholding the most recent modifications to the source code) and sell it as a proprietary product; • GNU General Public Licence (GNU, GPL, or GPL): Under this licence, licencees may not “close” versions. The licencee may modify, copy, and redistribute any derivative version, under the same GPL licence. The licencee can either charge a fee for this service or work free of charge. Free software first evolved during the 1970s but in the 1990s forked into two movements, namely free software and open source (Berry, 2004). Richard Stallman, an American software developer who believes that sharing source code and ideas is fundamental to freedom of speech, developed a free version of the widely used Unix operating system. The resulting GNU program was released under a specially created General Public Licence (GNU, GPL). This was designed to ensure that the source code would remain openly available to all. It was not intended to prevent commercial usage or distribution (Stallman, 2002). This approach was christened free software. In this context, free meant that anyone could modify the software. However, the term “free” was often misunderstood to mean no cost. Hence, during the 1990s, Eric Raymond and others proposed that open-source software was coined as a less contentious and more business-friendly term. This has become widely accepted within the software and business communities; however there are still arguments about the most appropriate term to use (Moody, 2002). The OSMs are usually organised into a network of individuals who work collaboratively on the Internet, developing major software projects that sometimes rival commercial software but are always committed to the production of quality alternatives to those produced by commercial companies (Raymond, 2001; Williams, 2002). Groups and individuals develop software to meet their own and others’ needs in a highly decentralised way, likened to a Bazaar (Raymond, 2001). These groups often make substantive value claims to support their projects and foster an ethic of community, collaboration, deliberation, and intellectual freedom. In addition, it is argued by Lessig (1999) that the FLOSS community can offer an inspiration in their commitment to transparency in their products and their ability to open up governmental regulation and control through free/libre and open source code.


Sign in / Sign up

Export Citation Format

Share Document