Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

Download Full-text

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Frontiers in Pharmacology ◽

10.3389/fphar.2020.565644 ◽

2020 ◽

Vol 11 ◽

Author(s):

Daniil Polykovskiy ◽

Alexander Zhebrak ◽

Benjamin Sanchez-Lengeling ◽

Sergey Golovanov ◽

Oktai Tatanov ◽

...

Keyword(s):

Virtual Screening ◽

Predictive Models ◽

Source Code ◽

Generative Models ◽

Reference Points ◽

Molecular Structures ◽

Training Dataset ◽

Chemistry Research

Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervized predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses.

Download Full-text

A Dynamic Improvement of a Training Dataset for Source Code Classification Using Deep Learning approach

Journal of University of Shanghai for Science and Technology ◽

10.51201/jusst/21/06501 ◽

2021 ◽

Vol 23 (06) ◽

pp. 10-22

Author(s):

Ms. Anshika Shukla ◽

◽

Mr. Sanjeev Kumar Shukla ◽

Keyword(s):

Deep Learning ◽

Classification Accuracy ◽

Source Code ◽

Training Data ◽

Classification Model ◽

Training Dataset ◽

Learning Approaches ◽

Dynamic Learning ◽

Data Set ◽

Learning Data

In recent years, there are various methods for source code classification using deep learning approaches have been proposed. The classification accuracy of the method using deep learning is greatly influenced by the training data set. Therefore, it is possible to create a model with higher accuracy by improving the construction method of the training data set. In this study, we propose a dynamic learning data set improvement method for source code classification using deep learning. In the proposed method, we first train and verify the source code classification model using the training data set. Next, we reconstruct the training data set based on the verification result. We create a high-precision model by repeating this learning and reconstruction and improving the learning data set. In the evaluation experiment, the source code classification model was learned using the proposed method, and the classification accuracy was compared with the three baseline methods. As a result, it was found that the model learned using the proposed method has the highest classification accuracy. We also confirmed that the proposed method improves the classification accuracy of the model from 0.64 to 0.96

Download Full-text

Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies

Entropy ◽

10.3390/e23091174 ◽

2021 ◽

Vol 23 (9) ◽

pp. 1174

Author(s):

Chen Yang ◽

Yan Liu ◽

Changqing Yin

Keyword(s):

Natural Language ◽

Language Processing ◽

Code Generation ◽

Domain Knowledge ◽

Large Scale ◽

Source Code ◽

Research Field ◽

Machine Learning Algorithms ◽

Transformation Model ◽

Essential Components

Source Code Generation (SCG) is a prevalent research field in the automation software engineering sector that maps specific descriptions to various sorts of executable code. Along with the numerous intensive studies, diverse SCG types that integrate different scenarios and contexts continue to emerge. As the ultimate purpose of SCG, Natural Language-based Source Code Generation (NLSCG) is growing into an attractive and challenging field, as the expressibility and extremely high abstraction of the input end. The booming large-scale dataset generated by open-source code repositories and Q&A resources, the innovation of machine learning algorithms, and the development of computing capacity make the NLSCG field promising and give more opportunities to the model implementation and perfection. Besides, we observed an increasing interest stream of NLSCG relevant studies recently, presenting quite various technical schools. However, many studies are bound to specific datasets with customization issues, producing occasional successful solutions with tentative technical methods. There is no systematic study to explore and promote the further development of this field. We carried out a systematic literature survey and tool research to find potential improvement directions. First, we position the role of NLSCG among various SCG genres, and specify the generation context empirically via software development domain knowledge and programming experiences; second, we explore the selected studies collected by a thoughtfully designed snowballing process, clarify the NLSCG field and understand the NLSCG problem, which lays a foundation for our subsequent investigation. Third, we model the research problems from technical focus and adaptive challenges, and elaborate insights gained from the NLSCG research backlog. Finally, we summarize the latest technology landscape over the transformation model and depict the critical tactics used in the essential components and their correlations. This research addresses the challenges of bridging the gap between natural language processing and source code analytics, outlines different dimensions of NLSCG research concerns and technical utilities, and shows a bounded technical context of NLSCG to facilitate more future studies in this promising area.

Download Full-text

Ontological Engineering For Source Code Generation

Future Computing and Informatics Journal ◽

10.54623/fue.fcij.4.2.1 ◽

2020 ◽

Vol 4 (2) ◽

pp. 52-66

Author(s):

Anas Hamid Alokla ◽

◽

Walaa Khaled Gad ◽

Mustafa .M Aref ◽

Abdel-badeeh .M Salem ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Code Generation ◽

Source Code ◽

Automatic Programming ◽

Ontological Engineering ◽

High Level ◽

High Level Abstraction

Source Code Generation (SCG) is the sub-domain of the Automatic Programming (AP) that helps programmers to program using high-level abstraction. Recently, many researchers investigated many techniques to access SCG. The problem is to use the appropriate technique to generate the source code due to its purposes and the inputs. This paper introduces a review and an analysis related SCG techniques. Moreover, comparisons are presented for: techniques mapping, Natural Language Processing (NLP), knowledgebase, ontology, Specification Configuration Template (SCT) model and deep learning.

Download Full-text

ABSent: Cross-Lingual Sentence Representation Mapping with Bidirectional GANs

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6279 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7756-7763 ◽

Cited By ~ 2

Author(s):

Zuohui Fu ◽

Yikun Xian ◽

Shijie Geng ◽

Yingqiang Ge ◽

Yuting Wang ◽

...

Keyword(s):

Neural Networks ◽

Transfer Learning ◽

Real World ◽

Source Code ◽

Training Data ◽

Learning Approaches ◽

Word Level ◽

Parallel Data ◽

Parallel Text ◽

Cross Lingual

A number of cross-lingual transfer learning approaches based on neural networks have been proposed for the case when large amounts of parallel text are at our disposal. However, in many real-world settings, the size of parallel annotated training data is restricted. Additionally, prior cross-lingual mapping research has mainly focused on the word level. This raises the question of whether such techniques can also be applied to effortlessly obtain cross-lingually aligned sentence representations. To this end, we propose an Adversarial Bi-directional Sentence Embedding Mapping (ABSent) framework, which learns mappings of cross-lingual sentence representations from limited quantities of parallel data. The experiments show that our method outperforms several technically more powerful approaches, especially under challenging low-resource circumstances. The source code is available from https://github.com/zuohuif/ABSent along with relevant datasets.

Download Full-text

An AdaBoost Using a Weak-Learner Generating Several Weak-Hypotheses for Large Training Data of Natural Language Processing

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.130.83 ◽

2010 ◽

Vol 130 (1) ◽

pp. 83-91 ◽

Cited By ~ 1

Author(s):

Tomoya Iwakura ◽

Seishi Okamoto ◽

Kazuo Asakawa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Weak Learner

Download Full-text

DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

Protein and Peptide Letters ◽

10.2174/0929866527666201202103411 ◽

2020 ◽

Vol 27 ◽

Author(s):

Zaheer Ullah Khan ◽

Dechang Pi

Keyword(s):

Large Scale ◽

Computational Models ◽

Research Work ◽

Training Data ◽

Training Dataset ◽

Validation Dataset ◽

Cytokine Signaling ◽

Minority Class ◽

Independent Dataset ◽

Feature Encoding

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Download Full-text

Improving 3-m Resolution Land Cover Mapping through Efficient Learning from an Imperfect 10-m Resolution Map

Remote Sensing ◽

10.3390/rs12091418 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1418

Author(s):

Runmin Dong ◽

Cong Li ◽

Haohuan Fu ◽

Jie Wang ◽

Weijia Li ◽

...

Keyword(s):

Land Cover ◽

Training Data ◽

Training Dataset ◽

Land Cover Mapping ◽

Remotely Sensed Data ◽

Large Area ◽

National Scale ◽

Substantial Progress ◽

Efficient Learning ◽

Land Cover Maps

Substantial progress has been made in the field of large-area land cover mapping as the spatial resolution of remotely sensed data increases. However, a significant amount of human power is still required to label images for training and testing purposes, especially in high-resolution (e.g., 3-m) land cover mapping. In this research, we propose a solution that can produce 3-m resolution land cover maps on a national scale without human efforts being involved. First, using the public 10-m resolution land cover maps as an imperfect training dataset, we propose a deep learning based approach that can effectively transfer the existing knowledge. Then, we improve the efficiency of our method through a network pruning process for national-scale land cover mapping. Our proposed method can take the state-of-the-art 10-m resolution land cover maps (with an accuracy of 81.24% for China) as the training data, enable a transferred learning process that can produce 3-m resolution land cover maps, and further improve the overall accuracy (OA) to 86.34% for China. We present detailed results obtained over three mega cities in China, to demonstrate the effectiveness of our proposed approach for 3-m resolution large-area land cover mapping.

Download Full-text

Bayesian Echo Classification for Australian Single-Polarization Weather Radar with Application to Assimilation of Radial Velocity Observations

Journal of Atmospheric and Oceanic Technology ◽

10.1175/jtech-d-14-00206.1 ◽

2015 ◽

Vol 32 (7) ◽

pp. 1341-1355 ◽

Cited By ~ 8

Author(s):

S. J. Rennie ◽

M. Curtis ◽

J. Peter ◽

A. W. Seed ◽

P. J. Steinle ◽

...

Keyword(s):

Weather Radar ◽

Training Data ◽

Training Dataset ◽

Bayes Classifier ◽

Threshold Method ◽

Ground Clutter ◽

Radar Network ◽

Routine Monitoring ◽

Single Polarization ◽

Anomalous Propagation

AbstractThe Australian Bureau of Meteorology’s operational weather radar network comprises a heterogeneous radar collection covering diverse geography and climate. A naïve Bayes classifier has been developed to identify a range of common echo types observed with these radars. The success of the classifier has been evaluated against its training dataset and by routine monitoring. The training data indicate that more than 90% of precipitation may be identified correctly. The echo types most difficult to distinguish from rainfall are smoke, chaff, and anomalous propagation ground and sea clutter. Their impact depends on their climatological frequency. Small quantities of frequently misclassified persistent echo (like permanent ground clutter or insects) can also cause quality control issues. The Bayes classifier is demonstrated to perform better than a simple threshold method, particularly for reducing misclassification of clutter as precipitation. However, the result depends on finding a balance between excluding precipitation and including erroneous echo. Unlike many single-polarization classifiers that are only intended to extract precipitation echo, the Bayes classifier also discriminates types of nonprecipitation echo. Therefore, the classifier provides the means to utilize clear air echo for applications like data assimilation, and the class information will permit separate data handling of different echo types.

Download Full-text

The Design and Implementation of Code Generation Based on J2EE in the Development of JBPM Workflow System

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.1961 ◽

2012 ◽

Vol 263-266 ◽

pp. 1961-1968

Author(s):

Yong Chao Song ◽

Bu Dan Wu ◽

Jun Liang Chen

Keyword(s):

Software Development ◽

Code Generation ◽

System Development ◽

Source Code ◽

Design And Implementation ◽

Workflow System ◽

Development Cycle ◽

The Cost ◽

Static Form

According to the feature of the JBPM workflow system development, the target code generated is determined by analyzing the process of JBPM workflow development and the architecture of J2EE. The code generation tool generates code by parsing the static form source code and loading the code generation template. The code generation tool greatly shortens the JBPM workflow system development cycle and reduces the cost of software development which has the good practicality and scalability.

Download Full-text