Supervised Authorship Segmentation of Open Source Code Projects

Abstract Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.

Download Full-text

ABMMRS Eradicator: Improving Accuracy in Recommending Move Methods for Web-based MVC Projects and Libraries Using Method’s External Dependencies

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194020500357 ◽

2020 ◽

Vol 30 (09) ◽

pp. 1289-1307

Author(s):

Atish Kumar Dipongkor ◽

Iftekhar Ahmed ◽

Rayhanul Islam ◽

Nadia Nahar ◽

Abdus Satter ◽

...

Keyword(s):

Open Source ◽

Source Code ◽

Abstract Syntax ◽

Unit Test ◽

Web Based ◽

Abstract Syntax Tree ◽

Regular Class ◽

Initial Move ◽

Improving Accuracy ◽

Action Methods

Move Method Refactoring (MMR) is used to place highly coupled methods in appropriate classes for making source code more cohesive. Like other refactoring techniques, it is mandatory that applying MMR will preserve applications’ behaviors. However, traditional MMR techniques failed to meet this essential precondition for Action methods in web-based application and API methods in libraries projects. The reason is that applying MMR on these methods changes the behaviors of the projects by raising Application-breaking issues, for instance, failure of browser requests and compilation errors in client projects. To resolve this problem, developers are suggested to manually check Action and API methods while applying MMR. However, manually inspecting thousands of lines of code for these issues is a time-consuming and hectic task. In this paper, an advanced MMR technique is proposed which automatically identifies Application-breaking MMR suggestions. This technique first takes the initial move method suggestions from the existing prominent MMR techniques e.g. JDeodorant. For each of the suggestions, it parses the source code and construct Abstract Syntax Tree to examine two types of usage. One is whether a suggestion has not been used in any unit test and Regular Class, and another is whether the suggestion has been used in unit test classes only. If any MMR suggestion is found having one of these two types of usage or both, the respective suggestion is marked as Application-breaking. In order to evaluate the proposed technique, several experiments have been conducted on open source projects. The experimental results show that the proposed technique achieved 96.4% Precision, 90% Recall and 93.1% F-score in detecting Application-breaking MMR suggestions, because of considering external dependencies of the MMR suggestions.

Download Full-text

Multi-Agent based Sequence Algorithm for Detecting Plagiarism and Clones in Java Source Code using Abstract Syntax Tree

International Journal of Computer Applications ◽

10.5120/15796-4494 ◽

2014 ◽

Vol 90 (15) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

D. Poongodi ◽

G.Tholkkappia Arasu

Keyword(s):

Source Code ◽

Abstract Syntax ◽

Agent Based ◽

Abstract Syntax Tree ◽

Syntax Tree ◽

Multi Agent

Download Full-text

WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

Scientific Programming ◽

10.1155/2017/7809047 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 12

Author(s):

Deqiang Fu ◽

Yanyan Xu ◽

Haoran Yu ◽

Boyang Yang

Keyword(s):

Kernel Method ◽

Source Code ◽

Detection Methods ◽

Abstract Syntax ◽

Plagiarism Detection ◽

Abstract Syntax Tree ◽

Syntax Tree ◽

Tree Kernel ◽

Document Frequency ◽

Abstract Syntax Trees

In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.

Download Full-text

Specifying and Detecting Behavioral Changes in Source Code Using Abstract Syntax Tree Differencing

Trustworthy Computing and Services - Communications in Computer and Information Science ◽

10.1007/978-3-642-35795-4_59 ◽

2013 ◽

pp. 466-473

Author(s):

Yuankui Li ◽

Linzhang Wang

Keyword(s):

Source Code ◽

Behavioral Changes ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Syntax Tree

Download Full-text

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

IEEE Access ◽

10.1109/access.2020.3026422 ◽

2020 ◽

Vol 8 ◽

pp. 175347-175359

Author(s):

Michal Duracik ◽

Patrik Hrkut ◽

Emil Krsak ◽

Stefan Toth

Keyword(s):

Source Code ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Syntax Tree

Download Full-text

An Intelligent Code Search Approach Using Hybrid Encoders

Wireless Communications and Mobile Computing ◽

10.1155/2021/9990988 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Yao Meng

Keyword(s):

Deep Neural Networks ◽

Source Code ◽

Structural Features ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Discriminative Models ◽

Learning Framework ◽

Code Search ◽

Parsing Algorithm ◽

Search Approach

The intelligent code search with natural language queries has become an important researching area in software engineering. In this paper, we propose a novel deep learning framework At-CodeSM for source code search. The powerful code encoder in At-CodeSM, which is implemented with an abstract syntax tree parsing algorithm (Tree-LSTM) and token-level encoders, maintains both the lexical and structural features of source code in the process of code vectorizing. Both the representative and discriminative models are implemented with deep neural networks. Our experiments on the CodeSearchNet dataset show that At-CodeSM yields better performance in the task of intelligent code searching than previous approaches.

Download Full-text

Code Summarization: Generating Summary of Code Snippets

10.21467/proceedings.114.47 ◽

2021 ◽

Author(s):

Shreya R. Mehta ◽

Sneha S. Patil ◽

Nikita S. Shirguppi ◽

Vahida Attar

Keyword(s):

Natural Language ◽

Long Range ◽

Source Code ◽

Attention Mechanism ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Syntax Tree ◽

Alternative Approach ◽

Transformer Model ◽

Java Code

Source Code Summarization refers to the task of creating understandable natural language summaries from a given code snippet. Good-quality and precise source code summaries are needed by numerous companies for a platitude of reasons - training for newly joined employees, understanding what a newly imported project does, in brief, maintaining precise summaries on the evolution of source code (using git history), categorizing the code, retrieving the code, automatically generating documents, etc. There is a considerable distinction between source code and natural language since source code is organized, has loops, conditions, structures, classes, and so on. Most of the models follow an encoder-decoder structure, we propose an alternative approach that uses UAST(Universal Abstract Syntax Tree) of the source code to generate tokens and then use the Transformer model for a self-attention mechanism which unlike the RNN method is helpful for capturing long-range dependencies. We have considered Java code snippets for generating code summaries.

Download Full-text