scholarly journals Deep Graph Matching and Searching for Semantic Code Retrieval

2021 ◽  
Vol 15 (5) ◽  
pp. 1-21
Author(s):  
Xiang Ling ◽  
Lingfei Wu ◽  
Saizhuo Wang ◽  
Gaoning Pan ◽  
Tengfei Ma ◽  
...  

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however, neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this article, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets, but also learns the fine-grained similarity between them by cross-attention based semantic matching operations. We evaluate the proposed DGMS model on two public code retrieval datasets with two representative programming languages (i.e., Java and Python). Experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS.

2020 ◽  
Vol 10 (17) ◽  
pp. 5804
Author(s):  
Shengwen Li ◽  
Renyao Chen ◽  
Bo Wan ◽  
Junfang Gong ◽  
Lin Yang ◽  
...  

Word embedding is an important reference for natural language processing tasks, which can generate distribution presentations of words based on many text data. Recent evidence demonstrates that introducing sememe knowledge is a promising strategy to improve the performance of word embedding. However, previous works ignored the structure information of sememe knowledges. To fill the gap, this study implicitly synthesized the structural feature of sememes into word embedding models based on an attention mechanism. Specifically, we propose a novel double attention word-based embedding (DAWE) model that encodes the characteristics of sememes into words by a “double attention” strategy. DAWE is integrated with two specific word training models through context-aware semantic matching techniques. The experimental results show that, in word similarity task and word analogy reasoning task, the performance of word embedding can be effectively improved by synthesizing the structural information of sememe knowledge. The case study also verifies the power of DAWE model in word sense disambiguation task. Furthermore, the DAWE model is a general framework for encoding sememes into words, which can be integrated into other existing word embedding models to provide more options for various natural language processing downstream tasks.


AERA Open ◽  
2021 ◽  
Vol 7 ◽  
pp. 233285842110286
Author(s):  
Kylie L. Anglin ◽  
Vivian C. Wong ◽  
Arielle Boguslav

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.


Author(s):  
Clifford Nangle ◽  
Stuart McTaggart ◽  
Margaret MacLeod ◽  
Jackie Caldwell ◽  
Marion Bennie

ABSTRACT ObjectivesThe Prescribing Information System (PIS) datamart, hosted by NHS National Services Scotland receives around 90 million electronic prescription messages per year from GP practices across Scotland. Prescription messages contain information including drug name, quantity and strength stored as coded, machine readable, data while prescription dose instructions are unstructured free text and difficult to interpret and analyse in volume. The aim, using Natural Language Processing (NLP), was to extract drug dose amount, unit and frequency metadata from freely typed text in dose instructions to support calculating the intended number of days’ treatment. This then allows comparison with actual prescription frequency, treatment adherence and the impact upon prescribing safety and effectiveness. ApproachAn NLP algorithm was developed using the Ciao implementation of Prolog to extract dose amount, unit and frequency metadata from dose instructions held in the PIS datamart for drugs used in the treatment of gastrointestinal, cardiovascular and respiratory disease. Accuracy estimates were obtained by randomly sampling 0.1% of the distinct dose instructions from source records, comparing these with metadata extracted by the algorithm and an iterative approach was used to modify the algorithm to increase accuracy and coverage. ResultsThe NLP algorithm was applied to 39,943,465 prescription instructions issued in 2014, consisting of 575,340 distinct dose instructions. For drugs used in the gastrointestinal, cardiovascular and respiratory systems (i.e. chapters 1, 2 and 3 of the British National Formulary (BNF)) the NLP algorithm successfully extracted drug dose amount, unit and frequency metadata from 95.1%, 98.5% and 97.4% of prescriptions respectively. However, instructions containing terms such as ‘as directed’ or ‘as required’ reduce the usability of the metadata by making it difficult to calculate the total dose intended for a specific time period as 7.9%, 0.9% and 27.9% of dose instructions contained terms meaning ‘as required’ while 3.2%, 3.7% and 4.0% contained terms meaning ‘as directed’, for drugs used in BNF chapters 1, 2 and 3 respectively. ConclusionThe NLP algorithm developed can extract dose, unit and frequency metadata from text found in prescriptions issued to treat a wide range of conditions and this information may be used to support calculating treatment durations, medicines adherence and cumulative drug exposure. The presence of terms such as ‘as required’ and ‘as directed’ has a negative impact on the usability of the metadata and further work is required to determine the level of impact this has on calculating treatment durations and cumulative drug exposure.


Author(s):  
Iraj Mantegh ◽  
Nazanin S. Darbandi

Robotic alternative to many manual operations falls short in application due to the difficulties in capturing the manual skill of an expert operator. One of the main problems to be solved if robots are to become flexible enough for various manufacturing needs is that of end-user programming. An end-user with little or no technical expertise in robotics area needs to be able to efficiently communicate its manufacturing task to the robot. This paper proposes a new method for robot task planning using some concepts of Artificial Intelligence. Our method is based on a hierarchical knowledge representation and propositional logic, which allows an expert user to incrementally integrate process and geometric parameters with the robot commands. The objective is to provide an intelligent and programmable agent such as a robot with a knowledge base about the attributes of human behaviors in order to facilitate the commanding process. The focus of this work is on robot programming for manufacturing applications. Industrial manipulators work with low level programming languages. This work presents a new method based on Natural Language Processing (NLP) that allows a user to generate robot programs using natural language lexicon and task information. This will enable a manufacturing operator (for example for painting) who may be unfamiliar with robot programming to easily employ the agent for the manufacturing tasks.


2020 ◽  
Vol 10 (8) ◽  
pp. 2824
Author(s):  
Yu-Hsiang Su ◽  
Ching-Ping Chao ◽  
Ling-Chien Hung ◽  
Sheng-Feng Sung ◽  
Pei-Ju Lee

Electronic medical records (EMRs) have been used extensively in most medical institutions for more than a decade in Taiwan. However, information overload associated with rapid accumulation of large amounts of clinical narratives has threatened the effective use of EMRs. This situation is further worsened by the use of “copying and pasting”, leading to lots of redundant information in clinical notes. This study aimed to apply natural language processing techniques to address this problem. New information in longitudinal clinical notes was identified based on a bigram language model. The accuracy of automated identification of new information was evaluated using expert annotations as the reference standard. A two-stage cross-over user experiment was conducted to evaluate the impact of highlighting of new information on task demands, task performance, and perceived workload. The automated method identified new information with an F1 score of 0.833. The user experiment found a significant decrease in perceived workload associated with a significantly higher task performance. In conclusion, automated identification of new information in clinical notes is feasible and practical. Highlighting of new information enables healthcare professionals to grasp key information from clinical notes with less perceived workload.


Author(s):  
Nikolina Stanić Loknar ◽  
◽  
Diana Bratić ◽  
Ana Agić ◽  
◽  
...  

Kinetic typography - text in motion is an animation method of characters that has a video form instead of some "static" form such as picture, poster or book. The most important element for figuration of kinetic typography is the choice of font. Furthermore, one should think about the letter cut, the size and color of the characters, and the background color on which the animation takes place. It can be created in various ways, most often using software that applies a multitude of effects to the text or letter character, creating dynamic solutions. The effects vary from the simplest such as "fade-in" and "fade-out" (entering and exiting text in and out of the frame). Static characters can expand, narrow, move slowly or rapidly, grow and change in a variety of ways to very complex ones in which the author builds an entire story or promotional video by carefully combining software capabilities. However, each software has its limitations and for this reason the kinetic typography presented in this paper is programmed using codes. In a wide range of available programming languages due to the simple interface that does not require advanced programming concepts and gives exceptional results in the field of kinetic typography, Processing was chosen. The Processing programming language is intended for generating and modifying graphics and is based on the Java programming language. The most important difference between Processing and Java is that Processing offers a simple programming interface that does not require advanced levels of programming such as classes, objects, or animations. It also allows advanced users to use them. Processing uses a variety of typography rendering approaches such as raster and vector solutions and allows typography to be programmed and displayed on the Web independently of the user's Web browser and font database. Processing enables the use of visual elements in animation, including typographic ones, by introducing interaction to the user. The user is no longer a passive observer but actively participates in the performance of the application whose final appearance is not predefined but arises from the actions of each individual user. For the purposes of this paper, individual letters were created in a font-making program. The letters made are of various written classifications and cuts, which with their variety contribute to the attractiveness of the animation. In the creating of motion typography in this paper, the programming language Processing was used. Written program codes that manipulate words, letters, or parts of characters to create interesting visual effects for the viewer that aim to hold the viewer's attention and convey the desired message or emotion. There are no strict rules and patterns when making kinetic typography. In kinetic typography, each author determines his own rules, method of production, and there are no same solutions.


Background/Objectives: In the field of software development, the diversity of programming languages increases dramatically with the increase in their complexity. This leads both programmers and researchers to develop and investigate automated tools to distinguish these programming languages. Different efforts were conducted to achieve this task using keywords of source codes of these programming languages. Therefore, instead of using keywords classification for recognition, this work is conducted to investigate the ability to detect the pattern of a programming language characteristic by using NeMo(High-performance spiking neural network simulator) of neural network and testing the ability of this toolkit to provide detailed analyzable results. Methods/Statistical analysis: the method of achieving these objectives is by using a back propagation neural network via NeMo based on pattern recognition methodology. Findings: The results show that the NeMo neural network of pattern recognition can identify and recognize the pattern of python programming language with high accuracy. It also shows the ability of the NeMo toolkit to represent the analyzable results through a percentage of certainty. Improvements/Applications: it can be noticed from the results the ability of NeMo simulator to provide beneficial platform for studying and analyzing the complexity of the backpropagation neural network model.


Author(s):  
Allan Fong ◽  
Nicholas Scoulios ◽  
H. Joseph Blumenthal ◽  
Ryan E. Anderson

Abstract Background and Objective The prevalence of value-based payment models has led to an increased use of the electronic health record to capture quality measures, necessitating additional documentation requirements for providers. Methods This case study uses text mining and natural language processing techniques to identify the timely completion of diabetic eye exams (DEEs) from 26,203 unique clinician notes for reporting as an electronic clinical quality measure (eCQM). Logistic regression and support vector machine (SVM) using unbalanced and balanced datasets, using the synthetic minority over-sampling technique (SMOTE) algorithm, were evaluated on precision, recall, sensitivity, and f1-score for classifying records positive for DEE. We then integrate a high precision DEE model to evaluate free-text clinical narratives from our clinical EHR system. Results Logistic regression and SVM models had comparable f1-score and specificity metrics with models trained and validated with no oversampling favoring precision over recall. SVM with and without oversampling resulted in the best precision, 0.96, and recall, 0.85, respectively. These two SVM models were applied to the unannotated 31,585 text segments representing 24,823 unique records and 13,714 unique patients. The number of records classified as positive for DEE using the SVM models ranged from 667 to 8,935 (2.7–36% out of 24,823, respectively). Unique patients classified as positive for DEE ranged from 3.5 to 41.8% highlighting the potential utility of these models. Discussion We believe the impact of oversampling on SVM model performance to be caused by the potential of overfitting of the SVM SMOTE model on the synthesized data and the data synthesis process. However, the specificities of SVM with and without SMOTE were comparable, suggesting both models were confident in their negative predictions. By prioritizing to implement the SVM model with higher precision over sensitivity or recall in the categorization of DEEs, we can provide a highly reliable pool of results that can be documented through automation, reducing the burden of secondary review. Although the focus of this work was on completed DEEs, this method could be applied to completing other necessary documentation by extracting information from natural language in clinician notes. Conclusion By enabling the capture of data for eCQMs from documentation generated by usual clinical practice, this work represents a case study in how such techniques can be leveraged to drive quality without increasing clinician work.


Sign in / Sign up

Export Citation Format

Share Document