A Source Code Similarity Based on Siamese Neural Network

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Download Full-text

A Hybrid Neural Network BERT-Cap Based on Pre-Trained Language Model and Capsule Network for User Intent Classification

Complexity ◽

10.1155/2020/8858852 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Hai Liu ◽

Yuanxia Liu ◽

Leung-Pun Wong ◽

Lap-Kei Lee ◽

Tianyong Hao

Keyword(s):

Neural Network ◽

Neural Network Model ◽

Question Answering ◽

Semantic Information ◽

Language Model ◽

Dialogue System ◽

Hybrid Neural Network ◽

Question Answering System ◽

User Intent ◽

Vital Component

User intent classification is a vital component of a question-answering system or a task-based dialogue system. In order to understand the goals of users’ questions or discourses, the system categorizes user text into a set of pre-defined user intent categories. User questions or discourses are usually short in length and lack sufficient context; thus, it is difficult to extract deep semantic information from these types of text and the accuracy of user intent classification may be affected. To better identify user intents, this paper proposes a BERT-Cap hybrid neural network model with focal loss for user intent classification to capture user intents in dialogue. The model uses multiple transformer encoder blocks to encode user utterances and initializes encoder parameters with a pre-trained BERT. Then, it extracts essential features using a capsule network with dynamic routing after utterances encoding. Experiment results on four publicly available datasets show that our model BERT-Cap achieves a F1 score of 0.967 and an accuracy of 0.967, outperforming a number of baseline methods, indicating its effectiveness in user intent classification.

Download Full-text

Prospecting the Effect of Topic Modeling in Information Retrieval

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2021070102 ◽

2021 ◽

Vol 17 (3) ◽

pp. 18-34

Author(s):

Aakanksha Sharaff ◽

Jitesh Kumar Dewangan ◽

Dilip Singh Sisodia

Keyword(s):

Information Retrieval ◽

Topic Modeling ◽

Topic Model ◽

Language Model ◽

High Dimensionality ◽

Retrieval Process ◽

Coherence Measure ◽

Retrieval Task ◽

Inverse Document Frequency ◽

Document Frequency

Enormous records and data are gathered every day. Organization of this data is a challenging task. Topic modeling provides a way to categorize these documents, where high dimensionality of the corpus affects the result of topic model, making it important to apply feature selection or information retrieval process for dimensionality reduction. The requirement for efficient topic modeling includes the removal of unrelated words that might lead to specious coexistence of the unrelated words. This paper proposes an efficient framework for the generation of better topic coherence, where term frequency-inverse document frequency (TF-IDF) and parsimonious language model (PLM) are used for the information retrieval task. PLM extracts the important information and expels the general words from the corpus, whereas TF-IDF re-estimates the weightage of each word in the corpus. The work carried out in this paper improved the topic coherence measure to provide a better correlation among the actual topic and the topics generated from PLM.

Download Full-text

Bert-Enhanced Text Graph Neural Network for Classification

Entropy ◽

10.3390/e23111536 ◽

2021 ◽

Vol 23 (11) ◽

pp. 1536

Author(s):

Yiping Yang ◽

Xiaohui Cui

Keyword(s):

Neural Network ◽

Semantic Information ◽

Structural Information ◽

Text Processing ◽

Language Model ◽

Research Direction ◽

Semantic Features ◽

Structure Information ◽

Single Text ◽

Graph Neural Networks

Text classification is a fundamental research direction, aims to assign tags to text units. Recently, graph neural networks (GNN) have exhibited some excellent properties in textual information processing. Furthermore, the pre-trained language model also realized promising effects in many tasks. However, many text processing methods cannot model a single text unit’s structure or ignore the semantic features. To solve these problems and comprehensively utilize the text’s structure information and semantic information, we propose a Bert-Enhanced text Graph Neural Network model (BEGNN). For each text, we construct a text graph separately according to the co-occurrence relationship of words and use GNN to extract text features. Moreover, we employ Bert to extract semantic features. The former part can take into account the structural information, and the latter can focus on modeling the semantic information. Finally, we interact and aggregate these two features of different granularity to get a more effective representation. Experiments on standard datasets demonstrate the effectiveness of BEGNN.

Download Full-text

Comparison of Malware Classification Methods using Convolutional Neural Network based on API Call Stream

International Journal of Network Security & Its Applications ◽

10.5121/ijnsa.2021.13201 ◽

2021 ◽

Vol 13 (2) ◽

pp. 1-19

Author(s):

Matthew Schofield ◽

Gulsum Alicioglu ◽

Bo Sun ◽

Russell Binaco ◽

Paul Turner ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Machine Learning Algorithms ◽

Inverse Document Frequency ◽

Detection Techniques ◽

Malware Classification ◽

Document Frequency ◽

Program Interface ◽

Type Classification

Malicious software is constantly being developed and improved, so detection and classification of malwareis an ever-evolving problem. Since traditional malware detection techniques fail to detect new/unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the API (Application Program Interface) calls. This research uses a database of 7107 instances of API call streams and 8 different malware types:Adware, Backdoor, Downloader, Dropper, Spyware, Trojan, Virus,Worm. We used a 1-Dimensional CNN by mapping API calls as categorical and term frequency-inverse document frequency (TF-IDF) vectors and compared the results to other classification techniques.The proposed 1-D CNN outperformed other classification techniques with 91% overall accuracy for both categorical and TFIDF vectors.

Download Full-text

Convolutional Neural Network for Malware Classification Based on API Call Sequence

10.5121/csit.2021.110106 ◽

2021 ◽

Author(s):

Matthew Schofield ◽

Gulsum Alicioglu ◽

Russell Binaco ◽

Paul Turner ◽

Cameron Thatcher ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Machine Learning Algorithms ◽

Accuracy Score ◽

Inverse Document Frequency ◽

Detection Techniques ◽

Document Frequency ◽

Program Interface ◽

Call Sequence

Malicious software is constantly being developed and improved, so detection and classification of malicious applications is an ever-evolving problem. Since traditional malware detection techniques fail to detect new or unknown malware, machine learning algorithms have been used to overcome this disadvantage. We present a Convolutional Neural Network (CNN) for malware type classification based on the Windows system API (Application Program Interface) calls. This research uses a database of 5385 instances of API call streams labeled with eight types of malware of the source malicious application. We use a 1-Dimensional CNN by mapping API call streams as categorical and term frequency-inverse document frequency (TF-IDF) vectors respectively. We achieved accuracy scores of 98.17% using TF-IDF vector and 95.40% via categorical vector. The proposed 1-D CNN outperformed other traditional classification techniques with overall accuracy score of 91.0%.

Download Full-text

Using Knowledge Graph and Search Query Click Logs in Statistical Language Model for Speech Recognition

10.21437/interspeech.2017-1790 ◽

2017 ◽

Author(s):

Weiwu Zhu

Keyword(s):

Speech Recognition ◽

Language Model ◽

Knowledge Graph ◽

Search Query ◽

Statistical Language Model

Download Full-text

Recurrent Neural Network Language Model with Incremental Updated Context Information Generated Using Bag-of-Words Representation

10.21437/interspeech.2016-375 ◽

2016 ◽

Cited By ~ 1

Author(s):

Md. Akmal Haidar ◽

Mikko Kurimo

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Language Model ◽

Context Information ◽

Bag Of Words ◽

Network Language

Download Full-text

Intelligent sentence completion based on global context dependent recurrent neural network language model

Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing - AIIPCC '19 ◽

10.1145/3371425.3371431 ◽

2019 ◽

Author(s):

Tao Yang ◽

Hongli Deng

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Language Model ◽

Sentence Completion ◽

Global Context ◽

Context Dependent ◽

Network Language

Download Full-text

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text

Research on the Improved Word2Vec Optimization Strategy Based on Statistical Language Model

2020 International Conference on Information Science, Parallel and Distributed Systems (ISPDS) ◽

10.1109/ispds51347.2020.00082 ◽

2020 ◽

Author(s):

Shi Lei

Keyword(s):

Language Model ◽

Optimization Strategy ◽

Statistical Language Model

Download Full-text