Code Clone Detection with Hierarchical Attentive Graph Embedding

Author(s):  
Xiujuan Ji ◽  
Lei Liu ◽  
Jingwen Zhu

Code clone serves as a typical programming manner that reuses the existing code to solve similar programming problems, which greatly facilitates software development but recurs program bugs and maintenance costs. Recently, deep learning-based detection approaches gradually present their effectiveness on feature representation and detection performance. Among them, deep learning approaches based on abstract syntax tree (AST) construct models relying on the node embedding technique. In AST, the semantic of nodes is obviously hierarchical, and the importance of nodes is quite different to determine whether the two code fragments are cloned or not. However, some approaches do not fully consider the hierarchical structure information of source code. Some approaches ignore the different importance of nodes when generating the features of source code. Thirdly, when the tree is very large and deep, many approaches are vulnerable to the gradient vanishing problem during training. In order to properly address these challenges, we propose a hierarchical attentive graph neural network embedding model-HAG for the code clone detection. Firstly, the attention mechanism is applied on nodes in AST to distinguish the importance of different nodes during the model training. In addition, the HAG adopts graph convolutional network (GCN) to propagate the code message on AST graph and then exploits a hierarchical differential pooling GCN to sufficiently capture the code semantics at different structure level. To evaluate the effectiveness of HAG, we conducted extensive experiments on public clone dataset and compared it with seven state-of-the-art clone detection models. The experimental results demonstrate that the HAG achieves superior detection performance compared with baseline models. Especially, in the detection of moderately Type-3 or Type-4 clones, the HAG particularly outperforms baselines, indicating the strong detection capability of HAG for semantic clones. Apart from that, the impacts of the hierarchical pooling, attention mechanism and critical model parameters are systematically discussed.

2020 ◽  
Author(s):  
Yuan Yuan ◽  
Lei Lin

Satellite image time series (SITS) classification is a major research topic in remote sensing and is relevant for a wide range of applications. Deep learning approaches have been commonly employed for SITS classification and have provided state-of-the-art performance. However, deep learning methods suffer from overfitting when labeled data is scarce. To address this problem, we propose a novel self-supervised pre-training scheme to initialize a Transformer-based network by utilizing large-scale unlabeled data. In detail, the model is asked to predict randomly contaminated observations given an entire time series of a pixel. The main idea of our proposal is to leverage the inherent temporal structure of satellite time series to learn general-purpose spectral-temporal representations related to land cover semantics. Once pre-training is completed, the pre-trained network can be further adapted to various SITS classification tasks by fine-tuning all the model parameters on small-scale task-related labeled data. In this way, the general knowledge and representations about SITS can be transferred to a label-scarce task, thereby improving the generalization performance of the model as well as reducing the risk of overfitting. Comprehensive experiments have been carried out on three benchmark datasets over large study areas. Experimental results demonstrate the effectiveness of the proposed method, leading to a classification accuracy increment up to 1.91% to 6.69%. <div><b>This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.</b></div>


Author(s):  
Tianyi Zhao ◽  
Yang Hu ◽  
Liang Cheng

Abstract Motivation: The functional changes of the genes, RNAs and proteins will eventually be reflected in the metabolic level. Increasing number of researchers have researched mechanism, biomarkers and targeted drugs by metabolites. However, compared with our knowledge about genes, RNAs, and proteins, we still know few about diseases-related metabolites. All the few existed methods for identifying diseases-related metabolites ignore the chemical structure of metabolites, fail to recognize the association pattern between metabolites and diseases, and fail to apply to isolated diseases and metabolites. Results: In this study, we present a graph deep learning based method, named Deep-DRM, for identifying diseases-related metabolites. First, chemical structures of metabolites were used to calculate similarities of metabolites. The similarities of diseases were obtained based on their functional gene network and semantic associations. Therefore, both metabolites and diseases network could be built. Next, Graph Convolutional Network (GCN) was applied to encode the features of metabolites and diseases, respectively. Then, the dimension of these features was reduced by Principal components analysis (PCA) with retainment 99% information. Finally, Deep neural network was built for identifying true metabolite-disease pairs (MDPs) based on these features. The 10-cross validations on three testing setups showed outstanding AUC (0.952) and AUPR (0.939) of Deep-DRM compared with previous methods and similar approaches. Ten of top 15 predicted associations between diseases and metabolites got support by other studies, which suggests that Deep-DRM is an efficient method to identify MDPs. Contact: [email protected]. Availability and implementation: https://github.com/zty2009/GPDNN-for-Identify-ing-Disease-related-Metabolites.


Sensors ◽  
2021 ◽  
Vol 22 (1) ◽  
pp. 157
Author(s):  
Saidrasul Usmankhujaev ◽  
Bunyodbek Ibrokhimov ◽  
Shokhrukh Baydadaev ◽  
Jangwoo Kwon

Deep neural networks (DNN) have proven to be efficient in computer vision and data classification with an increasing number of successful applications. Time series classification (TSC) has been one of the challenging problems in data mining in the last decade, and significant research has been proposed with various solutions, including algorithm-based approaches as well as machine and deep learning approaches. This paper focuses on combining the two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network. The proposed method proved to be more efficient than the previous state-of-the-art InceptionTime method. We tested our model on the univariate TSC benchmark (the UCR/UEA archive), which includes 85 time-series datasets, and proved that our network outperforms the InceptionTime in terms of the training time and overall accuracy on the UCR archive.


2020 ◽  
Author(s):  
Aaron Nicolson ◽  
Kuldip K. Paliwal

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.


2020 ◽  
Author(s):  
Aaron Nicolson ◽  
Kuldip K. Paliwal

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.


2021 ◽  
Author(s):  
Sujan Kumar Roy ◽  
Aaron Nicolson ◽  
Kuldip K. Paliwal

Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70\% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method.


2020 ◽  
Author(s):  
Yuan Yuan ◽  
Lei Lin

<div>Satellite image time series (SITS) classification is a major research topic in remote sensing and is relevant for a wide range of applications. Deep learning approaches have been commonly employed for SITS classification and have provided state-of-the-art performance. However, deep learning methods suffer from overfitting when labeled data is scarce. To address this problem, we propose a novel self-supervised pre-training scheme to initialize a Transformer-based network by utilizing large-scale unlabeled data. In detail, the model is asked to predict randomly contaminated observations given an entire time series of a pixel. The main idea of our proposal is to leverage the inherent temporal structure of satellite time series to learn general-purpose spectral-temporal representations related to land cover semantics. Once pre-training is completed, the pre-trained network can be further adapted to various SITS classification tasks by fine-tuning all the model parameters on small-scale task-related labeled data. In this way, the general knowledge and representations about SITS can be transferred to a label-scarce task, thereby improving the generalization performance of the model as well as reducing the risk of overfitting. Comprehensive experiments have been carried out on three benchmark datasets over large study areas. Experimental results demonstrate the effectiveness of the proposed method, leading to a classification accuracy increment up to 2.38% to 5.27%. The code and the pre-trained model will be available at https://github.com/linlei1214/SITS-BERT upon publication.</div><div><b>This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.</b></div>


2020 ◽  
Vol 11 (2) ◽  
pp. 49-75
Author(s):  
Amandeep Kaur ◽  
Sandeep Sharma ◽  
Munish Saini

Code clone refers to code snippets that are copied and pasted with or without modifications. In recent years, traditional approaches for clone detection combine with other domains for better detection of a clone. This paper discusses the systematic literature review of machine learning techniques used in code clone detection. This study provides insights into various tools and techniques developed for clone detection by implementing machine learning approaches and how effectively those tools and techniques to identify clones. The authors perform a systematic literature review on studies selected from popular computer science-related digital online databases from January 2004 to January 2020. The software system and datasets used for analyzing tools and techniques are mentioned. A neural network machine learning technique is primarily used for the identification of the clone. Clone detection based on a program dependency graph must be explored in the future because it carries semantic information of code fragments.


2021 ◽  
Vol 46 (3) ◽  
pp. 24-25
Author(s):  
Armijn Hemel ◽  
Karl Trygve Kalleberg ◽  
Rob Vermaas ◽  
Eelco Dolstra

Ten years ago, we published the article Finding software license violations through binary code clone detection at the MSR 2011 conference. Our paper was motivated by the tendency of em- bedded hardware vendors to only release binary blobs of their rmware, often violating the licensing terms of open-source soft- ware present inside those blobs. The techniques presented in our paper were designed to accurately identify open-source code hid- den inside binary blobs. Here, we give our perspectives on the impact of our work, both industrially and academically, and re- visit the original problem statement to see what has happened in the eld of open-source compliance in the intervening decade.


Sign in / Sign up

Export Citation Format

Share Document