scholarly journals Reusable Component Retrieval from a Large Repository Using Word2Vec with Continuous Bag of Words

2021 ◽  
Vol 26 (5) ◽  
pp. 453-460
Author(s):  
Krishna Chythanya Nagaraju ◽  
Cherku Ramesh Kumar Reddy

A reusable code component is the one which can be easily used with a little or no adaptation to fit in to the application being developed. The major concern in such process is the maintenance of these reusable components in one place called ‘Repository’, so that those code components can be effectively identified as well as reused. Word embedding allows us to numerically represent our textual information. They have become so pervasive that almost all Natural Language Processing projects make use of them. In this work, we considered to use Word2Vec concept to find vector representation of features of a reusable component. The features of a reusable component in the form of sequence of words are input to Word2Vec network. Our method using Word2Vec with Continuous Bag of Words out performs existing method in the market. The proposed methodology has shown an accuracy of 94.8% in identifying the existing reusable component.

2021 ◽  
Vol 11 (6) ◽  
pp. 2663
Author(s):  
Zhengru Shen ◽  
Marco Spruit

The summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use medicines, including adverse drug reactions. Using natural language processing (NLP) techniques to automatically extract adverse drug reactions from such unstructured textual information helps clinical experts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no research focusing on extracting from the Summary of Product Characteristics. In this work, we built a natural language processing pipeline that automatically scrapes the summary of product characteristics online and then extracts adverse drug reactions from them. Besides, we have made the method and its output publicly available so that it can be reused and further evaluated in clinical practices. In total, we extracted 32,797 common adverse drug reactions for 647 common medicines scraped from the Electronic Medicines Compendium. A manual review of 37 commonly used medicines has indicated a good performance, with a recall and precision of 0.99 and 0.934, respectively.


2021 ◽  
Vol 20 (8) ◽  
pp. 1574-1594
Author(s):  
Aleksandr R. NEVREDINOV

Subject. When evaluating enterprises, maximum accuracy and comprehensiveness of analysis are important, although the use of various indicators of organization’s financial condition and external factors provide a sufficiently high accuracy of forecasting. Many researchers are increasingly focusing on the natural language processing to analyze various text sources. This subject is extremely relevant against the needs of companies to quickly and extensively analyze their activities. Objectives. The study aims at exploring the natural language processing methods and sources of textual information about companies that can be used in the analysis, and developing an approach to the analysis of textual information. Methods. The study draws on methods of analysis and synthesis, systematization, formalization, comparative analysis, theoretical and methodological provisions contained in domestic and foreign scientific works on text analysis, including for purposes of company evaluation. Results. I offer and test an approach to using non-numeric indicators for company analysis. The paper presents a unique model, which is created on the basis of existing developments that have shown their effectiveness. I also substantiate the use of this approach to analyze a company’s condition and to include the analysis results in models for overall assessment of the state of companies. Conclusions. The findings improve scientific and practical understanding of techniques for the analysis of companies, the ways of applying text analysis, using machine learning. They can be used to support management decision-making to automate the analysis of their own and other companies in the market, with which they interact.


2021 ◽  
Author(s):  
Minoru Yoshida ◽  
Kenji Kita

Both words and numerals are tokens found in almost all documents but they have different properties. However, relatively little attention has been paid in numerals found in texts and many systems treated the numbers found in the document in ad-hoc ways, such as regarded them as mere strings in the same way as words, normalized them to zeros, or simply ignored them. Recent growth of natural language processing (NLP) research areas has change this situations and more and more attentions have been paid to the numeracy in documents. In this survey, we provide a quick overview of the history and recent advances of the research of mining such relations between numerals and words found in text data.


Author(s):  
Davide Picca ◽  
Dominique Jaccard ◽  
Gérald Eberlé

In the last decades, Natural Language Processing (NLP) has obtained a high level of success. Interactions between NLP and Serious Games have started and some of them already include NLP techniques. The objectives of this paper are twofold: on the one hand, providing a simple framework to enable analysis of potential uses of NLP in Serious Games and, on the other hand, applying the NLP framework to existing Serious Games and giving an overview of the use of NLP in pedagogical Serious Games. In this paper we present 11 serious games exploiting NLP techniques. We present them systematically, according to the following structure:  first, we highlight possible uses of NLP techniques in Serious Games, second, we describe the type of NLP implemented in the each specific Serious Game and, third, we provide a link to possible purposes of use for the different actors interacting in the Serious Game.


2020 ◽  
Vol 28 (4) ◽  
pp. 1551-1579
Author(s):  
Leevi Rantala ◽  
Mika Mäntylä

AbstractSelf-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.


2014 ◽  
Vol 22 (1) ◽  
pp. 73-95 ◽  
Author(s):  
GÁBOR BEREND

AbstractKeyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.


2020 ◽  
Vol 9 (05) ◽  
pp. 25039-25046 ◽  
Author(s):  
Rahul C Kore ◽  
Prachi Ray ◽  
Priyanka Lade ◽  
Amit Nerurkar

Reading legal documents are tedious and sometimes it requires domain knowledge related to that document. It is hard to read the full legal document without missing the key important sentences. With increasing number of legal documents it would be convenient to get the essential information from the document without having to go through the whole document. The purpose of this study is to understand a large legal document within a short duration of time. Summarization gives flexibility and convenience to the reader. Using vector representation of words, text ranking algorithms, similarity techniques, this study gives a way to produce the highest ranked sentences. Summarization produces the result in such a way that it covers the most vital information of the document in a concise manner. The paper proposes how the different natural language processing concepts can be used to produce the desired result and give readers the relief from going through the whole complex document. This study definitively presents the steps that are required to achieve the aim and elaborates all the algorithms used at each and every step in the process.


1990 ◽  
Vol 5 (4) ◽  
pp. 225-249 ◽  
Author(s):  
Ann Copestake ◽  
Karen Sparck Jones

AbstractThis paper reviews the current state of the art in natural language access to databases. This has been a long-standing area of work in natural language processing. But though some commercial systems are now available, providing front ends has proved much harder than was expected, and the necessary limitations on front ends have to be recognized. The paper discusses the issues, both general to language and task-specific, involved in front end design, and the way these have been addressed, concentrating on the work of the last decade. The focus is on the central process of translating a natural language question into a database query, but other supporting functions are also covered. The points are illustrated by the use of a single example application. The paper concludes with an evaluation of the current state, indicating that future progress will depend on the one hand on general advances in natural language processing, and on the other on expanding the capabilities of traditional databases.


Author(s):  
Laura Felice ◽  
Carmen Leonardi ◽  
Liliana Favre ◽  
Maria Virginia Mauco

Reusability is the ability to use the same software elements for constructing many different applications. Formal specifications can help to semiautomatic design processes based on reusable components. However, during the first stages of development, when the interaction with the stakeholders is crucial, the use of client-oriented requirements engineering techniques seems to be necessary in order to enhance the communication between the stakeholders and the software engineers. In this chapter, we propose a systematic reuse approach that integrates natural language requirement specifications with formal specifications in RSL (RAISE Specification Language). On the one hand, some heuristics are described to develop a formal specification in RSL starting from models belonging to the Requirements Baseline. On the other hand, we have defined a reusable component model that integrates RSL specifications at different levels of abstraction, as well as presented a process with reuse based on the model.


Sign in / Sign up

Export Citation Format

Share Document