Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training

Software clone detection is an important problem for software maintenance and evolution and it has attracted lots of attentions. However, existing approaches ignore a fact that people would label the pairs of code fragments as \emph{clone} only if they happen to discover the clones while a huge number of undiscovered clone pairs and non-clone pairs are left unlabeled. In this paper, we argue that the clone detection task in the real-world should be formalized as a Positive-Unlabeled (PU) learning problem, and address this problem by proposing a novel positive and unlabeled learning approach, namely CDPU, to effectively detect software functional clones, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level, where adversarial training is employed to improve the robustness of the learned model to those non-clone pairs that look extremely similar but behave differently. Experiments on software clone detection benchmarks indicate that the proposed approach together with adversarial training outperforms the state-of-the-art approaches for software functional clone detection.

Download Full-text

Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/423 ◽

2017 ◽

Cited By ~ 25

Author(s):

Huihui Wei ◽

Ming Li

Keyword(s):

Software Maintenance ◽

Feature Learning ◽

Clone Detection ◽

Source Codes ◽

Learning Framework ◽

Deep Feature ◽

Functional Clone ◽

Software Maintenance And Evolution ◽

Software Clone Detection ◽

Learning To Hash

Software clone detection, aiming at identifying out code fragments with similar functionalities, has played an important role in software maintenance and evolution. Many clone detection approaches have been proposed. However, most of them represent source codes with hand-crafted features using lexical or syntactical information, or unsupervised deep features, which makes it difficult to detect the functional clone pairs, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level. In this paper, we address the software functional clone detection problem by learning supervised deep features. We formulate the clone detection as a supervised learning to hash problem and propose an end-to-end deep feature learning framework called CDLH for functional clone detection. Such framework learns hash codes by exploiting the lexical and syntactical information for fast computation of functional similarity between code fragments. Experiments on software clone detection benchmarks indicate that the CDLH approach is effective and outperforms the state-of-the-art approaches in software functional clone detection.

Download Full-text

Software Clone Detection and Refactoring

ISRN Software Engineering ◽

10.1155/2013/129437 ◽

2013 ◽

Vol 2013 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Francesca Arcelli Fontana ◽

Marco Zanoni ◽

Andrea Ranchetti ◽

Davide Ranchetti

Keyword(s):

Software Quality ◽

Software Maintenance ◽

Quality Metrics ◽

Clone Detection ◽

Software Maintenance And Evolution ◽

Software Clones ◽

Software Clone Detection ◽

Points Of View ◽

Software Quality Metrics ◽

The Impact

Several studies have been proposed in the literature on software clones from different points of view and covering many correlated features and areas, which are particularly relevant to software maintenance and evolution. In this paper, we describe our experience on clone detection through three different tools and investigate the impact of clone refactoring on different software quality metrics.

Download Full-text

Summarizing Source Code with Transferred API Knowledge

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/314 ◽

2018 ◽

Cited By ~ 14

Author(s):

Xing Hu ◽

Ge Li ◽

Xin Xia ◽

David Lo ◽

Shuai Lu ◽

...

Keyword(s):

Real World ◽

Software Maintenance ◽

Large Scale ◽

State Of The Art ◽

Source Code ◽

Code Search ◽

Novel Approach ◽

Software Maintenance And Evolution ◽

World Industry ◽

Similar Code

Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.

Download Full-text

Positive and Unlabeled Learning with Label Disambiguation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/590 ◽

2019 ◽

Author(s):

Chuang Zhang ◽

Dexin Ren ◽

Tongliang Liu ◽

Jian Yang ◽

Chen Gong

Keyword(s):

State Of The Art ◽

Ground Truth ◽

Training Data ◽

Learning Approaches ◽

Generalization Error ◽

Binary Classifier ◽

Learning Problem ◽

Pu Learning ◽

Positive And Unlabeled Learning ◽

Real World Datasets

Positive and Unlabeled (PU) learning aims to learn a binary classifier from only positive and unlabeled training data. The state-of-the-art methods usually formulate PU learning as a cost-sensitive learning problem, in which every unlabeled example is simultaneously treated as positive and negative with different class weights. However, the ground-truth label of an unlabeled example should be unique, so the existing models inadvertently introduce the label noise which may lead to the biased classifier and deteriorated performance. To solve this problem, this paper proposes a novel algorithm dubbed as "Positive and Unlabeled learning with Label Disambiguation'' (PULD). We first regard all the unlabeled examples in PU learning as ambiguously labeled as positive and negative, and then employ the margin-based label disambiguation strategy, which enlarges the margin of classifier response between the most likely label and the less likely one, to find the unique ground-truth label of each unlabeled example. Theoretically, we derive the generalization error bound of the proposed method by analyzing its Rademacher complexity. Experimentally, we conduct intensive experiments on both benchmark and real-world datasets, and the results clearly demonstrate the superiority of the proposed PULD to the existing PU learning approaches.

Download Full-text

Robust CNN Compression Framework for Security-Sensitive Embedded Systems

Applied Sciences ◽

10.3390/app11031093 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1093

Author(s):

Jeonghyun Lee ◽

Sangkyun Lee

Keyword(s):

Embedded Systems ◽

Optimization Problem ◽

State Of The Art ◽

Classification Problems ◽

Proximal Gradient Method ◽

Knowledge Distillation ◽

New Type ◽

Adversarial Examples ◽

Adversarial Training ◽

Memory Efficient

Convolutional neural networks (CNNs) have achieved tremendous success in solving complex classification problems. Motivated by this success, there have been proposed various compression methods for downsizing the CNNs to deploy them on resource-constrained embedded systems. However, a new type of vulnerability of compressed CNNs known as the adversarial examples has been discovered recently, which is critical for security-sensitive systems because the adversarial examples can cause malfunction of CNNs and can be crafted easily in many cases. In this paper, we proposed a compression framework to produce compressed CNNs robust against such adversarial examples. To achieve the goal, our framework uses both pruning and knowledge distillation with adversarial training. We formulate our framework as an optimization problem and provide a solution algorithm based on the proximal gradient method, which is more memory-efficient than the popular ADMM-based compression approaches. In experiments, we show that our framework can improve the trade-off between adversarial robustness and compression rate compared to the existing state-of-the-art adversarial pruning approach.

Download Full-text

Augmenting Bug Localization with Part-of-Speech and Invocation

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194017500346 ◽

2017 ◽

Vol 27 (06) ◽

pp. 925-949 ◽

Cited By ~ 5

Author(s):

Yu Zhou ◽

Yanxiang Tong ◽

Taolue Chen ◽

Jin Han

Keyword(s):

Software Maintenance ◽

Large Scale ◽

Bug Localization ◽

Bug Reports ◽

Part Of Speech ◽

Adaptive Technique ◽

Bug Report ◽

Software Maintenance And Evolution ◽

Speech Features ◽

Localization Approach

Bug localization represents one of the most expensive, as well as time-consuming, activities during software maintenance and evolution. To alleviate the workload of developers, numerous methods have been proposed to automate this process and narrow down the scope of reviewing buggy files. In this paper, we present a novel buggy source-file localization approach, using the information from both the bug reports and the source files. We leverage the part-of-speech features of bug reports and the invocation relationship among source files. We also integrate an adaptive technique to further optimize the performance of the approach. The adaptive technique discriminates Top 1 and Top N recommendations for a given bug report and consists of two modules. One module is to maximize the accuracy of the first recommended file, and the other one aims at improving the accuracy of the fixed defect file list. We evaluate our approach on six large-scale open source projects, i.e. ASpectJ, Eclipse, SWT, Zxing, Birt and Tomcat. Compared to the previous work, empirical results show that our approach can improve the overall prediction performance in all of these cases. Particularly, in terms of the Top 1 recommendation accuracy, our approach achieves an enhancement from 22.73% to 39.86% for ASpectJ, from 24.36% to 30.76% for Eclipse, from 31.63% to 46.94% for SWT, from 40% to 55% for ZXing, from 7.97% to 21.99% for Birt, and from 33.37% to 38.90% for Tomcat.

Download Full-text

Benchmarks for software clone detection: A ten-year retrospective

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) ◽

10.1109/saner.2018.8330194 ◽

2018 ◽

Cited By ~ 11

Author(s):

Chanchal K. Roy ◽

James R. Cordy

Keyword(s):

Clone Detection ◽

Software Clone Detection

Download Full-text

Imbalanced Learning Based on Logistic Discrimination

Computational Intelligence and Neuroscience ◽

10.1155/2016/5423204 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Weimei Zhi ◽

Hongbing Liu ◽

Mingliang Xu

Keyword(s):

Statistical Model ◽

Cost Function ◽

State Of The Art ◽

Class Imbalance ◽

Imbalanced Learning ◽

Learning Problem ◽

Logistic Discrimination ◽

Positive Class ◽

Negative Class ◽

Novel Method

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall,g-mean,f-measure, AUC, and accuracy.

Download Full-text

Improved Algorithms for Conservative Exploration in Bandits

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5812 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3962-3969

Author(s):

Evrard Garcelon ◽

Mohammad Ghavamzadeh ◽

Alessandro Lazaric ◽

Matteo Pirotta

Keyword(s):

State Of The Art ◽

Digital Marketing ◽

Learning Problem ◽

Online Learning Algorithms ◽

Empirical Performance ◽

Regret Bounds ◽

Healthcare Finance ◽

And Robotics ◽

Novel Algorithm ◽

Real World Problems

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

Download Full-text

Regularized Training and Tight Certification for Randomized Smoothed Classifier with Provable Robustness

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5798 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3858-3865

Author(s):

Huijie Feng ◽

Chunpeng Wu ◽

Guoyang Chen ◽

Weifeng Zhang ◽

Yang Ning

Keyword(s):

Neural Network ◽

High Probability ◽

Deep Neural Network ◽

State Of The Art ◽

Computationally Efficient ◽

Base Classifier ◽

Training Scheme ◽

Adversarial Training ◽

Gaussian Perturbation ◽

Probabilistic Robustness

Recently smoothing deep neural network based classifiers via isotropic Gaussian perturbation is shown to be an effective and scalable way to provide state-of-the-art probabilistic robustness guarantee against ℓ2 norm bounded adversarial perturbations. However, how to train a good base classifier that is accurate and robust when smoothed has not been fully investigated. In this work, we derive a new regularized risk, in which the regularizer can adaptively encourage the accuracy and robustness of the smoothed counterpart when training the base classifier. It is computationally efficient and can be implemented in parallel with other empirical defense methods. We discuss how to implement it under both standard (non-adversarial) and adversarial training scheme. At the same time, we also design a new certification algorithm, which can leverage the regularization effect to provide tighter robustness lower bound that holds with high probability. Our extensive experimentation demonstrates the effectiveness of the proposed training and certification approaches on CIFAR-10 and ImageNet datasets.

Download Full-text