scholarly journals Supervised Authorship Segmentation of Open Source Code Projects

2021 ◽  
Vol 2021 (4) ◽  
pp. 464-479
Author(s):  
Edwin Dauber ◽  
Robert Erbacher ◽  
Gregory Shearer ◽  
Michael Weisman ◽  
Frederica Nelson ◽  
...  

Abstract Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.

Author(s):  
Atish Kumar Dipongkor ◽  
Iftekhar Ahmed ◽  
Rayhanul Islam ◽  
Nadia Nahar ◽  
Abdus Satter ◽  
...  

Move Method Refactoring (MMR) is used to place highly coupled methods in appropriate classes for making source code more cohesive. Like other refactoring techniques, it is mandatory that applying MMR will preserve applications’ behaviors. However, traditional MMR techniques failed to meet this essential precondition for Action methods in web-based application and API methods in libraries projects. The reason is that applying MMR on these methods changes the behaviors of the projects by raising Application-breaking issues, for instance, failure of browser requests and compilation errors in client projects. To resolve this problem, developers are suggested to manually check Action and API methods while applying MMR. However, manually inspecting thousands of lines of code for these issues is a time-consuming and hectic task. In this paper, an advanced MMR technique is proposed which automatically identifies Application-breaking MMR suggestions. This technique first takes the initial move method suggestions from the existing prominent MMR techniques e.g. JDeodorant. For each of the suggestions, it parses the source code and construct Abstract Syntax Tree to examine two types of usage. One is whether a suggestion has not been used in any unit test and Regular Class, and another is whether the suggestion has been used in unit test classes only. If any MMR suggestion is found having one of these two types of usage or both, the respective suggestion is marked as Application-breaking. In order to evaluate the proposed technique, several experiments have been conducted on open source projects. The experimental results show that the proposed technique achieved 96.4% Precision, 90% Recall and 93.1% F-score in detecting Application-breaking MMR suggestions, because of considering external dependencies of the MMR suggestions.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Deqiang Fu ◽  
Yanyan Xu ◽  
Haoran Yu ◽  
Boyang Yang

In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 175347-175359
Author(s):  
Michal Duracik ◽  
Patrik Hrkut ◽  
Emil Krsak ◽  
Stefan Toth

2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Yao Meng

The intelligent code search with natural language queries has become an important researching area in software engineering. In this paper, we propose a novel deep learning framework At-CodeSM for source code search. The powerful code encoder in At-CodeSM, which is implemented with an abstract syntax tree parsing algorithm (Tree-LSTM) and token-level encoders, maintains both the lexical and structural features of source code in the process of code vectorizing. Both the representative and discriminative models are implemented with deep neural networks. Our experiments on the CodeSearchNet dataset show that At-CodeSM yields better performance in the task of intelligent code searching than previous approaches.


2021 ◽  
Author(s):  
Shreya R. Mehta ◽  
Sneha S. Patil ◽  
Nikita S. Shirguppi ◽  
Vahida Attar

Source Code Summarization refers to the task of creating understandable natural language summaries from a given code snippet. Good-quality and precise source code summaries are needed by numerous companies for a platitude of reasons - training for newly joined employees, understanding what a newly imported project does, in brief, maintaining precise summaries on the evolution of source code (using git history), categorizing the code, retrieving the code, automatically generating documents, etc. There is a considerable distinction between source code and natural language since source code is organized, has loops, conditions, structures, classes, and so on. Most of the models follow an encoder-decoder structure, we propose an alternative approach that uses UAST(Universal Abstract Syntax Tree) of the source code to generate tokens and then use the Transformer model for a self-attention mechanism which unlike the RNN method is helpful for capturing long-range dependencies. We have considered Java code snippets for generating code summaries.


2005 ◽  
Vol 30 (4) ◽  
pp. 1-5 ◽  
Author(s):  
Iulian Neamtiu ◽  
Jeffrey S. Foster ◽  
Michael Hicks

Sign in / Sign up

Export Citation Format

Share Document