Inter-Project Functional Clone Detection Toward Building Libraries - An Empirical Study on 13,000 Projects

Author(s):  
Tomoya Ishihara ◽  
Keisuke Hotta ◽  
Yoshiki Higo ◽  
Hiroshi Igaki ◽  
Shinji Kusumoto
2016 ◽  
Vol 2 ◽  
pp. e49 ◽  
Author(s):  
Stefan Wagner ◽  
Asim Abdulkhaleq ◽  
Ivan Bogicevic ◽  
Jan-Peter Ostberg ◽  
Jasmin Ramadani

Background. Today, redundancy in source code, so-called “clones” caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, not caused by copy&paste. At present, it is not clear how onlyfunctionally similar clones(FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactical differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research.Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs.Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in <16% of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories.Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.


2016 ◽  
Author(s):  
Stefan Wagner ◽  
Asim Abdulkhaleq ◽  
Ivan Bogicevic ◽  
Jan-Peter Ostberg ◽  
Jasmin Ramadani

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactic differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.


2019 ◽  
Vol 4 (12) ◽  
pp. 9-15
Author(s):  
Pallavi Sharma ◽  
Chetanpal Singh

Code clone is that type of engine that helps to find duplicate code patterns find within the whole code. Programmers usually adopt code reusability task from previous few years, so that time consumption can be reduces. Code reusability can be done via replication or by just copy-paste. Code reusability leads to not writing code from scratch, just copy paste the useful part of the code. In finding of duplicated code fragment or text, plagiarism detection also work pretty well but it is not applicable to the large system in finding functional clone and also it is more time consuming even at small scale which make the detection method inappropriate. In this paper, we proposed a pattern similarity conditions on the basis of textual similarity for finding the code or text clones in the large content on the basis of SVM, Neural Network using Java coding, Neural Network and Sim Cad. This approach detects code or text clones from original one. The resultant simulation is taken place in the MATLAB environment, and it has shown that it is providing better results. The proposed algorithm performance is measured using parameters i.e. FRR, FAR and Accuracy.


2020 ◽  
pp. 1-15
Author(s):  
Wei Hua ◽  
Yulei Sui ◽  
Yao Wan ◽  
Guangzhong Liu ◽  
Guandong Xu

2016 ◽  
Author(s):  
Stefan Wagner ◽  
Asim Abdulkhaleq ◽  
Ivan Bogicevic ◽  
Jan-Peter Ostberg ◽  
Jasmin Ramadani

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactic differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.


Author(s):  
Hui-Hui Wei ◽  
Ming Li

Software clone detection is an important problem for software maintenance and evolution and it has attracted lots of attentions. However, existing approaches ignore a fact that people would label the pairs of code fragments as \emph{clone} only if they happen to discover the clones while a huge number of undiscovered clone pairs and non-clone pairs are left unlabeled. In this paper, we argue that the clone detection task in the real-world should be formalized as a Positive-Unlabeled (PU) learning problem, and address this problem by proposing a novel positive and unlabeled learning approach, namely CDPU, to effectively detect software functional clones, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level, where adversarial training is employed to improve the robustness of the learned model to those non-clone pairs that look extremely similar but behave differently. Experiments on software clone detection benchmarks indicate that the proposed approach together with adversarial training outperforms the state-of-the-art approaches for software functional clone detection.


Author(s):  
Huihui Wei ◽  
Ming Li

Software clone detection, aiming at identifying out code fragments with similar functionalities, has played an important role in software maintenance and evolution. Many clone detection approaches have been proposed. However, most of them represent source codes with hand-crafted features using lexical or syntactical information, or unsupervised deep features, which makes it difficult to detect the functional clone pairs, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level. In this paper, we address the software functional clone detection problem by learning supervised deep features. We formulate the clone detection as a supervised learning to hash problem and propose an end-to-end deep feature learning framework called CDLH for functional clone detection. Such framework learns hash codes by exploiting the lexical and syntactical information for fast computation of functional similarity between code fragments. Experiments on software clone detection benchmarks indicate that the CDLH approach is effective and outperforms the state-of-the-art approaches in software functional clone detection.


1996 ◽  
Vol 81 (1) ◽  
pp. 76-87 ◽  
Author(s):  
Connie R. Wanberg ◽  
John D. Watt ◽  
Deborah J. Rumsey

Sign in / Sign up

Export Citation Format

Share Document