Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies

Source Code Generation (SCG) is a prevalent research field in the automation software engineering sector that maps specific descriptions to various sorts of executable code. Along with the numerous intensive studies, diverse SCG types that integrate different scenarios and contexts continue to emerge. As the ultimate purpose of SCG, Natural Language-based Source Code Generation (NLSCG) is growing into an attractive and challenging field, as the expressibility and extremely high abstraction of the input end. The booming large-scale dataset generated by open-source code repositories and Q&A resources, the innovation of machine learning algorithms, and the development of computing capacity make the NLSCG field promising and give more opportunities to the model implementation and perfection. Besides, we observed an increasing interest stream of NLSCG relevant studies recently, presenting quite various technical schools. However, many studies are bound to specific datasets with customization issues, producing occasional successful solutions with tentative technical methods. There is no systematic study to explore and promote the further development of this field. We carried out a systematic literature survey and tool research to find potential improvement directions. First, we position the role of NLSCG among various SCG genres, and specify the generation context empirically via software development domain knowledge and programming experiences; second, we explore the selected studies collected by a thoughtfully designed snowballing process, clarify the NLSCG field and understand the NLSCG problem, which lays a foundation for our subsequent investigation. Third, we model the research problems from technical focus and adaptive challenges, and elaborate insights gained from the NLSCG research backlog. Finally, we summarize the latest technology landscape over the transformation model and depict the critical tactics used in the essential components and their correlations. This research addresses the challenges of bridging the gap between natural language processing and source code analytics, outlines different dimensions of NLSCG research concerns and technical utilities, and shows a bounded technical context of NLSCG to facilitate more future studies in this promising area.

Download Full-text

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Computational Linguistics ◽

10.1162/coli_a_00357 ◽

2019 ◽

Vol 45 (3) ◽

pp. 559-601 ◽

Cited By ~ 3

Author(s):

Edoardo Maria Ponti ◽

Helen O’Horan ◽

Yevgeni Berzak ◽

Ivan Vulić ◽

Roi Reichart ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Machine Learning Algorithms ◽

Data Driven ◽

Extensive Literature ◽

New Approach ◽

Recent Developments ◽

Discrete Nature

Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.

Download Full-text

Ontological Engineering For Source Code Generation

Future Computing and Informatics Journal ◽

10.54623/fue.fcij.4.2.1 ◽

2020 ◽

Vol 4 (2) ◽

pp. 52-66

Author(s):

Anas Hamid Alokla ◽

◽

Walaa Khaled Gad ◽

Mustafa .M Aref ◽

Abdel-badeeh .M Salem ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Code Generation ◽

Source Code ◽

Automatic Programming ◽

Ontological Engineering ◽

High Level ◽

High Level Abstraction

Source Code Generation (SCG) is the sub-domain of the Automatic Programming (AP) that helps programmers to program using high-level abstraction. Recently, many researchers investigated many techniques to access SCG. The problem is to use the appropriate technique to generate the source code due to its purposes and the inputs. This paper introduces a review and an analysis related SCG techniques. Moreover, comparisons are presented for: techniques mapping, Natural Language Processing (NLP), knowledgebase, ontology, Specification Configuration Template (SCT) model and deep learning.

Download Full-text

Applications of Natural Language Processing in Biodiversity Science

Advances in Bioinformatics ◽

10.1155/2012/391574 ◽

2012 ◽

Vol 2012 ◽

pp. 1-17 ◽

Cited By ~ 30

Author(s):

Anne E. Thessen ◽

Hong Cui ◽

Dmitry Mozzherin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Large Scale ◽

Morphological Characters ◽

Machine Learning Algorithms ◽

Biological Information ◽

Biological Knowledge ◽

Biodiversity Science

Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.

Download Full-text

Natural language processing methods for knowledge management—Applying document clustering for fast search and grouping of engineering documents

Concurrent Engineering ◽

10.1177/1063293x20982973 ◽

2021 ◽

pp. 1063293X2098297

Author(s):

Ivar Örn Arnarsson ◽

Otto Frost ◽

Emil Gustavsson ◽

Mats Jirstrand ◽

Johan Malmqvist

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Clustering Algorithms ◽

Document Clustering ◽

Unstructured Data ◽

Free Text ◽

Engineering Change ◽

Engineering Documents

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.

Download Full-text

Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events.

10.31234/osf.io/uygzv ◽

2021 ◽

Author(s):

Xinxu Shen ◽

Troy Houser ◽

David Victor Smith ◽

Vishnu P. Murty

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Individual Difference ◽

Language Processing ◽

Large Scale ◽

High Reliability ◽

Difference Analysis ◽

Universal Sentence ◽

Natural Language Processing Tool

The use of naturalistic stimuli, such as narrative movies, is gaining popularity in many fields, characterizing memory, affect, and decision-making. Narrative recall paradigms are often used to capture the complexity and richness of memory for naturalistic events. However, scoring narrative recalls is time-consuming and prone to human biases. Here, we show the validity and reliability of using a natural language processing tool, the Universal Sentence Encoder (USE), to automatically score narrative recall. We compared the reliability in scoring made between two independent raters (i.e., hand-scored) and between our automated algorithm and individual raters (i.e., automated) on trial-unique, video clips of magic tricks. Study 1 showed that our automated segmentation approaches yielded high reliability and reflected measures yielded by hand-scoring, and further that the results using USE outperformed another popular natural language processing tool, GloVe. In study two, we tested whether our automated approach remained valid when testing individual’s varying on clinically-relevant dimensions that influence episodic memory, age and anxiety. We found that our automated approach was equally reliable across both age groups and anxiety groups, which shows the efficacy of our approach to assess narrative recall in large-scale individual difference analysis. In sum, these findings suggested that machine learning approaches implementing USE are a promising tool for scoring large-scale narrative recalls and perform individual difference analysis for research using naturalistic stimuli.

Download Full-text

The Experience of Developing a Large-Scale Natural Language Processing System: Critique

The Kluwer International Series in Engineering and Computer Science - Natural Language Processing: The PLNLP Approach ◽

10.1007/978-1-4615-3170-8_7 ◽

1993 ◽

pp. 77-89 ◽

Cited By ~ 2

Author(s):

Stephen Richardson ◽

Lisa Braden-Harder

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Processing System ◽

Natural Language Processing System

Download Full-text

Designing and Validating an Annotation Model of Dynamic Modality for English and Spanish: Issues and Problems

10.29007/pc58 ◽

2018 ◽

Author(s):

Julia Lavid ◽

Marta Carretero ◽

Juan Rafael Zamorano

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Reliability Study ◽

Annotation Scheme ◽

High Degree ◽

Difficult Cases

In this paper we set forth an annotation model for dynamic modality in English and Spanish, given its relevance not only for contrastive linguistic purposes, but also for its impact on practical annotation tasks in the Natural Language Processing (NLP) community. An annotation scheme is proposed, which captures both the functional-semantic meanings and the language-specific realisations of dynamic meanings in both languages. The scheme is validated through a reliability study performed on a randomly selected set of one hundred and twenty sentences from the MULTINOT corpus, resulting in a high degree of inter-annotator agreement. We discuss our main findings and give attention to the difficult cases as they are currently being used to develop detailed guidelines for the large-scale annotation of dynamic modality in English and Spanish.

Download Full-text

From Text to Thought: How Analyzing Language Can Advance Psychological Science

10.31234/osf.io/nws35 ◽

2020 ◽

Author(s):

Joshua Conrad Jackson ◽

Joseph Watts ◽

Johann-Mattis List ◽

Ryan Drabble ◽

Kristen Lindquist

Keyword(s):

Natural Language ◽

Language Processing ◽

Statistical Power ◽

Culturally Diverse ◽

Large Scale ◽

Human Cognition ◽

Comparative Linguistics ◽

Psychological Science ◽

Language Analysis ◽

The Mind

Humans have been using language for thousands of years, but psychologists seldom consider what natural language can tell us about the mind. Here we propose that language offers a unique window into human cognition. After briefly summarizing the legacy of language analyses in psychological science, we show how methodological advances have made these analyses more feasible and insightful than ever before. In particular, we describe how two forms of language analysis—comparative linguistics and natural language processing—are already contributing to how we understand emotion, creativity, and religion, and overcoming methodological obstacles related to statistical power and culturally diverse samples. We summarize resources for learning both of these methods, and highlight the best way to combine language analysis techniques with behavioral paradigms. Applying language analysis to large-scale and cross-cultural datasets promises to provide major breakthroughs in psychological science.

Download Full-text

Comparison of Templates with Word2vec in Finding Semantic Relations Between Words

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201805007 ◽

2018 ◽

pp. 13-17

Author(s):

Kaan Ant ◽

Ugur Sogukpinar ◽

Mehmet Fatif Amasyali

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Semantic Relations ◽

Template Method ◽

Semantic Relationships ◽

Semantic Spaces

The use of databases those containing semantic relationships between words is becoming increasingly widespread in order to make natural language processing work more effective. Instead of the word-bag approach, the suggested semantic spaces give the distances between words, but they do not express the relation types. In this study, it is shown how semantic spaces can be used to find the type of relationship and it is compared with the template method. According to the results obtained on a very large scale, while is_a and opposite are more successful for semantic spaces for relations, the approach of templates is more successful in the relation types at_location, made_of and non relational.

Download Full-text

A Comparative Analysis of Machine Learning Techniques for Spam Detection

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1308 ◽

2021 ◽

pp. 657-661

Author(s):

Rashida Ali ◽

Ibrahim Rampurawala ◽

Mayuri Wandhe ◽

Ruchika Shrikhande ◽

Arpita Bhatkar

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Comparative Analysis ◽

Natural Language ◽

Language Processing ◽

High Volume ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Spam Detection ◽

Learning Techniques

Internet provides a medium to connect with individuals of similar or different interests creating a hub. Since a huge hub participates on these platforms, the user can receive a high volume of messages from different individuals creating a chaos and unwanted messages. These messages sometimes contain a true information and sometimes false, which leads to a state of confusion in the minds of the users and leads to first step towards spam messaging. Spam messages means an irrelevant and unsolicited message sent by a known/unknown user which may lead to a sense of insecurity among users. In this paper, the different machine learning algorithms were trained and tested with natural language processing (NLP) to classify whether the messages are spam or ham.

Download Full-text