Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

Download Full-text

SOURCE CODE AUTHORSHIP ANALYSIS FOR SUPPORTING THE CYBERCRIME INVESTIGATION PROCESS

Proceedings of the First International Conference on E-Business and Telecommunication Networks ◽

10.5220/0001390300850092 ◽

2004 ◽

Cited By ~ 1

Keyword(s):

Source Code ◽

Investigation Process ◽

Authorship Analysis

Download Full-text

Source Code Author Identification Based on N-gram Author Profiles

IFIP International Federation for Information Processing - Artificial Intelligence Applications and Innovations ◽

10.1007/0-387-34224-9_59 ◽

2006 ◽

pp. 508-515 ◽

Cited By ~ 24

Author(s):

Georgia Frantzeskou ◽

Efstathios Stamatatos ◽

Stefanos Gritzalis ◽

Sokratis Katsikas

Keyword(s):

Source Code ◽

Author Identification ◽

N Gram

Download Full-text

Multi-χ: Identifying Multiple Authors from Source Code Files

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0044 ◽

2020 ◽

Vol 2020 (3) ◽

pp. 25-41

Author(s):

Mohammed Abuhamad ◽

Tamer Abuhmed ◽

DaeHun Nyang ◽

David Mohaisen

Keyword(s):

Source Code ◽

Compact Representation ◽

Integration Process ◽

Software Projects ◽

Identification Schemes ◽

Author Identification ◽

Multiple Dimensions ◽

Authorship Identification ◽

Verification Model ◽

Tree Extraction

AbstractMost authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code parsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook’s Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.

Download Full-text

Human Activity Recognition using Fourier Transform Inspired Deep Learning Combination Model

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327908666180727123657 ◽

2019 ◽

Vol 9 (1) ◽

pp. 16-31

Author(s):

Kyungkoo Jun

Keyword(s):

Fourier Transform ◽

Deep Learning ◽

Short Term Memory ◽

Window Size ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Labeling Scheme

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.

Download Full-text

Online Judging Platform Utilizing Dynamic Plagiarism Detection Facilities

Computers ◽

10.3390/computers10040047 ◽

2021 ◽

Vol 10 (4) ◽

pp. 47

Author(s):

Fariha Iffath ◽

A. S. M. Kayes ◽

Md. Tahsin Rahman ◽

Jannatul Ferdows ◽

Mohammad Shamsul Arefin ◽

...

Keyword(s):

Source Code ◽

Large Data ◽

Large Data Sets ◽

Detection Technique ◽

Data Sets ◽

Plagiarism Detection ◽

Source Codes ◽

Efficient Detection ◽

Mathematical Problems ◽

Automatic Scoring

A programming contest generally involves the host presenting a set of logical and mathematical problems to the contestants. The contestants are required to write computer programs that are capable of solving these problems. An online judge system is used to automate the judging procedure of the programs that are submitted by the users. Online judges are systems designed for the reliable evaluation of the source codes submitted by the users. Traditional online judging platforms are not ideally suitable for programming labs, as they do not support partial scoring and efficient detection of plagiarized codes. When considering this fact, in this paper, we present an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently. Our system performs the detection of plagiarism by detecting fingerprints of programs and using the fingerprints to compare them instead of using the whole file. We used winnowing to select fingerprints among k-gram hash values of a source code, which was generated by the Rabin–Karp Algorithm. The proposed system is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability. In addition, we evaluated our system by using large data sets and comparing the run time with MOSS, which is the widely used plagiarism detection technique.

Download Full-text

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

2021 13th International Conference on Machine Learning and Computing ◽

10.1145/3457682.3457709 ◽

2021 ◽

Author(s):

Pranali Bora ◽

Tulika Awalgaonkar ◽

Himanshu Palve ◽

Raviraj Joshi ◽

Purvi Goel

Keyword(s):

Neural Network ◽

Source Code ◽

Network Approach ◽

Neural Network Approach ◽

Author Identification ◽

Hierarchical Neural Network

Download Full-text

A Practical Black-box Attack on Source Code Authorship Identification Classifiers

IEEE Transactions on Information Forensics and Security ◽

10.1109/tifs.2021.3080507 ◽

2021 ◽

pp. 1-1

Author(s):

Qianjun Liu ◽

Shouling Ji ◽

Changchang Liu ◽

Chunming Wu

Keyword(s):

Source Code ◽

Black Box ◽

Authorship Identification

Download Full-text

An alternative distribution to Lindley and Power Lindley distributions with characterizations, different estimation methods and data applications

Mathematica Slovaca ◽

10.1515/ms-2017-0406 ◽

2020 ◽

Vol 70 (4) ◽

pp. 953-978

Author(s):

Mustafa Ç. Korkmaz ◽

G. G. Hamedani

Keyword(s):

Hazard Function ◽

Mixture Distribution ◽

Real Data ◽

Quantile Function ◽

Estimation Methods ◽

Data Sets ◽

Unknown Parameters ◽

Lorenz Curves ◽

Proposed Model ◽

New Distribution

AbstractThis paper proposes a new extended Lindley distribution, which has a more flexible density and hazard rate shapes than the Lindley and Power Lindley distributions, based on the mixture distribution structure in order to model with new distribution characteristics real data phenomena. Its some distributional properties such as the shapes, moments, quantile function, Bonferonni and Lorenz curves, mean deviations and order statistics have been obtained. Characterizations based on two truncated moments, conditional expectation as well as in terms of the hazard function are presented. Different estimation procedures have been employed to estimate the unknown parameters and their performances are compared via Monte Carlo simulations. The flexibility and importance of the proposed model are illustrated by two real data sets.

Download Full-text

Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde

Forum for Information Retrieval Evaluation ◽

10.1145/3441501.3441532 ◽

2020 ◽

Author(s):

Ali Fadel ◽

Husam Musleh ◽

Ibraheem Tuffaha ◽

Mahmoud Al-Ayyoub ◽

Yaser Jararweh ◽

...

Keyword(s):

Source Code ◽

Authorship Identification

Download Full-text

An Improved Approach for Reduction of Defect Density Using Optimal Module Sizes

Advances in Software Engineering ◽

10.1155/2014/803530 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Dinesh Verma ◽

Shishir Kumar

Keyword(s):

Goodness Of Fit ◽

Defect Density ◽

Optimal Size ◽

Data Sets ◽

Density Parameter ◽

Proposed Model ◽

Module Size ◽

Curve Fitting Method ◽

Improved Model ◽

Effective Distribution

Nowadays, software developers are facing challenges in minimizing the number of defects during the software development. Using defect density parameter, developers can identify the possibilities of improvements in the product. Since the total number of defects depends on module size, so there is need to calculate the optimal size of the module to minimize the defect density. In this paper, an improved model has been formulated that indicates the relationship between defect density and variable size of modules. This relationship could be used for optimization of overall defect density using an effective distribution of modules sizes. Three available data sets related to concern aspect have been examined with the proposed model by taking the distinct values of variables and parameter by putting some constraint on parameters. Curve fitting method has been used to obtain the size of module with minimum defect density. Goodness of fit measures has been performed to validate the proposed model for data sets. The defect density can be optimized by effective distribution of size of modules. The larger modules can be broken into smaller modules and smaller modules can be merged to minimize the overall defect density.

Download Full-text