Syntax tree fingerprinting for source code similarity detection

Source-Code Similarity Measurement: Syntax Tree Fingerprinting for Automated Evaluation

10.1145/3486001.3486228 ◽

2021 ◽

Author(s):

Arjun Verma ◽

Prateksha Udhayanan ◽

Rahul Murali Shankar ◽

Nikhila KN ◽

Sujit Kumar Chakrabarti

Keyword(s):

Source Code ◽

Similarity Measurement ◽

Syntax Tree ◽

Automated Evaluation

Multi-Agent based Sequence Algorithm for Detecting Plagiarism and Clones in Java Source Code using Abstract Syntax Tree

International Journal of Computer Applications ◽

10.5120/15796-4494 ◽

2014 ◽

Vol 90 (15) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

D. Poongodi ◽

G.Tholkkappia Arasu

Keyword(s):

Source Code ◽

Abstract Syntax ◽

Agent Based ◽

Abstract Syntax Tree ◽

Syntax Tree ◽

Multi Agent

Scalable Source Code Similarity Detection in Large Code Repositories

ICST Transactions on Scalable Information Systems ◽

10.4108/eai.13-7-2018.159353 ◽

2019 ◽

Vol 6 (22) ◽

pp. 159353

Author(s):

Firas Alomari ◽

Muhammed Harbi

Keyword(s):

Source Code ◽

Similarity Detection

WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

Scientific Programming ◽

10.1155/2017/7809047 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 12

Author(s):

Deqiang Fu ◽

Yanyan Xu ◽

Haoran Yu ◽

Boyang Yang

Keyword(s):

Kernel Method ◽

Source Code ◽

Detection Methods ◽

Abstract Syntax ◽

Plagiarism Detection ◽

Abstract Syntax Tree ◽

Syntax Tree ◽

Tree Kernel ◽

Document Frequency ◽

Abstract Syntax Trees

In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.

Specifying and Detecting Behavioral Changes in Source Code Using Abstract Syntax Tree Differencing

Trustworthy Computing and Services - Communications in Computer and Information Science ◽

10.1007/978-3-642-35795-4_59 ◽

2013 ◽

pp. 466-473

Author(s):

Yuankui Li ◽

Linzhang Wang

Keyword(s):

Source Code ◽

Behavioral Changes ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Syntax Tree

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

IEEE Access ◽

10.1109/access.2020.3026422 ◽

2020 ◽

Vol 8 ◽

pp. 175347-175359

Author(s):

Michal Duracik ◽

Patrik Hrkut ◽

Emil Krsak ◽

Stefan Toth

Keyword(s):

Source Code ◽

Abstract Syntax ◽

Abstract Syntax Tree ◽

Syntax Tree

Flow Chart Generation-Based Source Code Similarity Detection Using Process Mining

Scientific Programming ◽

10.1155/2020/8865413 ◽

2020 ◽

Vol 2020 ◽

pp. 1-15

Author(s):

Feng Zhang ◽

Lulu Li ◽

Cong Liu ◽

Qingtian Zeng

Keyword(s):

Computer Programming ◽

Process Mining ◽

Source Code ◽

Plagiarism Detection ◽

Dynamic Features ◽

Code Obfuscation ◽

Loop Unrolling ◽

Similarity Detection ◽

Flow Charts ◽

And Function

Source code similarity detection has extensive applications in computer programming teaching and software intellectual property protection. In the teaching of computer programming courses, students may utilize some complex source code obfuscation techniques, e.g., opaque predicates, loop unrolling, and function inlining and outlining, to reduce the similarity between code fragments and avoid the plagiarism detection. Existing source code similarity detection approaches only consider static features of source code, making it difficult to cope with more complex code obfuscation techniques. In this paper, we propose a novel source code similarity detection approach by considering the dynamic features at runtime of source code using process mining. More specifically, given two pieces of source code, their running logs are obtained by source code instrumentation and execution. Next, process mining is used to obtain the flow charts of the two pieces of source code by analyzing their collected running logs. Finally, similarity of the two pieces of source code is measured by computing the similarity of these two flow charts. Experimental results show that the proposed approach can deal with more complex obfuscation techniques including opaque predicates and loop unrolling as well as function inlining and outlining, which cannot be handled by existing work properly. Therefore, we argue that our approach can defeat commonly used code obfuscation techniques more effectively for source code similarity detection than the existing state-of-the-art approaches.

Source-code Similarity Detection and Detection Tools Used in Academia

ACM Transactions on Computing Education ◽

10.1145/3313290 ◽

2019 ◽

Vol 19 (3) ◽

pp. 1-37 ◽

Cited By ~ 8

Author(s):

Matija Novak ◽

Mike Joy ◽

Dragutin Kermek

Keyword(s):

Source Code ◽

Similarity Detection

Similarity Detection in Large Volume Data using Machine Learning Techniques

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c3987.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 735-739

Keyword(s):

Machine Learning ◽

Big Data ◽

Source Code ◽

Computational Cost ◽

Machine Learning Techniques ◽

Volume Data ◽

Plagiarism Detection ◽

Computationally Efficient ◽

Source Codes ◽

Similarity Detection

When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning

A new model for collaborative learning of programming using source code similarity detection

2015 IEEE Global Engineering Education Conference (EDUCON) ◽

10.1109/educon.2015.7096047 ◽

2015 ◽

Cited By ~ 2

Author(s):

Emil Stankov ◽

Mile Jovanov ◽

Bojan Kostadinov ◽

Ana Madevska Bogdanova

Keyword(s):

Collaborative Learning ◽

Source Code ◽

New Model ◽

Similarity Detection