scholarly journals A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing

This paper acts as a comprehensive analytical study of natural language processing (NLP) and provides a briefing of the most prominent astounding reforms of the field over a good chunk of time. It covers even the future research insights and most relevant features, which act as a result of the discussed concepts or research, until this paper's reading point. This paper starts with covering the most basic concepts of text cleaning, such as tokenization, the importance of stop words, etc., to concepts such as sequence modeling, speech recognition, the effect of quantum computing concepts in Natural Language Processing, and so on. The current development of deep neural networks, which is the current trend in artificial intelligence, always gives NLP a cutting-edge technology, also covered in this paper. This paper will also emphasize that it covers the broad area of explanations to the concepts to guide learners or researchers to have an excellent overall understanding of the field.

2021 ◽  
Vol 11 (7) ◽  
pp. 3184
Author(s):  
Ismael Garrido-Muñoz  ◽  
Arturo Montejo-Ráez  ◽  
Fernando Martínez-Santiago  ◽  
L. Alfonso Ureña-López 

Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.


2020 ◽  
Vol 7 (1) ◽  
pp. 54-60
Author(s):  
Falia Amalia ◽  
Moch Arif Bijaksana

Abstract — The Qur'an is one of the research in linguistic branches that have not been studied by many experts in their field so it has not gotten a popular place. Whereas in the Qur'an, very many words can be used to be researched especially in terms of Natural Language Processing such as text classification, document clustering, text summarization, etc. One of them is like the semantic similarity and the Distribution Semantic Model. The purpose of this writing is to try to create an evaluation dataset in the model of semantic distribution in Bahasa Indonesia with two classes of words that are noun and verb, looking for equal value and linkage of 500 word-pairs provided. Hopefully by looking at this, the semantic sciences that exist for the study of the Qur'an are growing, especially in the translation of the Quran in the Indonesia Language. This research was created at the same time to create datasets such as previously conducted research, in order to hope that future research with the focus of other discussions can use this dataset to help with the research. The study uses 6236 number of verses and from the number of such verses, the system gets 2193 for nouns and 1733 for verbs. The amount is processed using the Sim-rail vector method, a questionnaire against 15 respondents and gold standard, to get the performance value measured using Spearman Rank and get a correlation result of 0.909. Keywords — Natural Language Processing; Distribution Semantic Model; Sim-Rel Vector; Spearman Rank


2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Simon Geletta ◽  
Lendie Follett ◽  
Marcia Laugerman

Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.


Author(s):  
Goran Klepac ◽  
Marko Velić

This chapter covers natural language processing techniques and their application in predicitve models development. Two case studies are presented. First case describes a project where textual descriptions of various situations in call center of one telecommunication company were processed in order to predict churn. Second case describes sentiment analysis of business news and describes practical and testing issues in text mining projects. Both case studies depict different approaches and are implemented in different tools. Language of the texts processed in these projects is Croatian which belongs to the Slavic group of languages with more complex morphologies and grammar rules than English. Chapter concludes with several points on the future research possible in this domain.


2021 ◽  
Vol 11 (15) ◽  
pp. 7160
Author(s):  
Ramon Ruiz-Dolz ◽  
Montserrat Nofre ◽  
Mariona Taulé ◽  
Stella Heras ◽  
Ana García-Fornes

The application of the latest Natural Language Processing breakthroughs in computational argumentation has shown promising results, which have raised the interest in this area of research. However, the available corpora with argumentative annotations are often limited to a very specific purpose or are not of adequate size to take advantage of state-of-the-art deep learning techniques (e.g., deep neural networks). In this paper, we present VivesDebate, a large, richly annotated and versatile professional debate corpus for computational argumentation research. The corpus has been created from 29 transcripts of a debate tournament in Catalan and has been machine-translated into Spanish and English. The annotation contains argumentative propositions, argumentative relations, debate interactions and professional evaluations of the arguments and argumentation. The presented corpus can be useful for research on a heterogeneous set of computational argumentation underlying tasks such as Argument Mining, Argument Analysis, Argument Evaluation or Argument Generation, among others. All this makes VivesDebate a valuable resource for computational argumentation research within the context of massive corpora aimed at Natural Language Processing tasks.


2019 ◽  
Author(s):  
Simon Geletta ◽  
Lendie Follett ◽  
Marcia R Laugerman

Abstract This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Solomon Akinboro ◽  
Oluwadamilola Adebusoye ◽  
Akintoye Onamade

Offensive content refers to messages which are socially unacceptable including vulgar or derogatory messages. As the use of social media increases worldwide, social media administrators are faced with the challenges of tackling the inclusion of offensive content, to ensure clean and non-abusive or offensive conversations on the platforms they provide.  This work organizes and describes techniques used for the automated detection of offensive languages in social media content in recent times, providing a structured overview of previous approaches, including algorithms, methods and main features used.   Selection was from peer-reviewed articles on Google scholar. Search terms include: Profane words, natural language processing, multilingual context, hybrid methods for detecting profane words and deep learning approach for detecting profane words. Exclusions were made based on some criteria. Initial search returned 203 of which only 40 studies met the inclusion criteria; 6 were on natural language processing, 6 studies were on Deep learning approaches, 5 reports analysed hybrid approaches, multi-level classification/multi-lingual classification appear in 13 reports while 10 reports were on other related methods.The limitations of previous efforts to tackle the challenges with regards to the detection of offensive contents are highlighted to aid future research in this area.  Keywords— algorithm, offensive content, profane words, social media, texts


Sign in / Sign up

Export Citation Format

Share Document