A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing

This paper acts as a comprehensive analytical study of natural language processing (NLP) and provides a briefing of the most prominent astounding reforms of the field over a good chunk of time. It covers even the future research insights and most relevant features, which act as a result of the discussed concepts or research, until this paper's reading point. This paper starts with covering the most basic concepts of text cleaning, such as tokenization, the importance of stop words, etc., to concepts such as sequence modeling, speech recognition, the effect of quantum computing concepts in Natural Language Processing, and so on. The current development of deep neural networks, which is the current trend in artificial intelligence, always gives NLP a cutting-edge technology, also covered in this paper. This paper will also emphasize that it covers the broad area of explanations to the concepts to guide learners or researchers to have an excellent overall understanding of the field.

Download Full-text

A Survey on Bias in Deep NLP

Applied Sciences ◽

10.3390/app11073184 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3184

Author(s):

Ismael Garrido-Muñoz ◽

Arturo Montejo-Ráez ◽

Fernando Martínez-Santiago ◽

L. Alfonso Ureña-López

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Natural Language Processing ◽

Probability Distribution ◽

Natural Language ◽

Network Design ◽

Language Processing ◽

Deep Neural Networks ◽

Learning Processes ◽

Relevant Issue

Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.

Download Full-text

Empirical evaluation of multi-task learning in deep neural networks for natural language processing

Neural Computing and Applications ◽

10.1007/s00521-020-05268-w ◽

2020 ◽

Author(s):

Jianquan Li ◽

Xiaokang Liu ◽

Wenpeng Yin ◽

Min Yang ◽

Liqun Ma ◽

...

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Deep Neural Networks ◽

Empirical Evaluation ◽

Task Learning

Download Full-text

Performance Comparison of Natural Language Processing Model Based on Deep Neural Networks

The Journal of Korean Institute of Communications and Information Sciences ◽

10.7840/kics.2019.44.7.1344 ◽

2019 ◽

Vol 44 (7) ◽

pp. 1344-1350

Author(s):

Taegyeom Lee ◽

Kyungseop Shin

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Deep Neural Networks ◽

Performance Comparison ◽

Model Based

Download Full-text

Semantic Model Evaluation Dataset For Indonesian In Al-Qur'an Vocabulary: Similarity and Relatedness

Jurnal Teknologi Informasi dan Terapan ◽

10.25047/jtit.v7i1.137 ◽

2020 ◽

Vol 7 (1) ◽

pp. 54-60

Author(s):

Falia Amalia ◽

Moch Arif Bijaksana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Summarization ◽

Future Research ◽

Semantic Model ◽

Vector Method ◽

Spearman Rank ◽

Correlation Result ◽

Evaluation Dataset

Abstract — The Qur'an is one of the research in linguistic branches that have not been studied by many experts in their field so it has not gotten a popular place. Whereas in the Qur'an, very many words can be used to be researched especially in terms of Natural Language Processing such as text classification, document clustering, text summarization, etc. One of them is like the semantic similarity and the Distribution Semantic Model. The purpose of this writing is to try to create an evaluation dataset in the model of semantic distribution in Bahasa Indonesia with two classes of words that are noun and verb, looking for equal value and linkage of 500 word-pairs provided. Hopefully by looking at this, the semantic sciences that exist for the study of the Qur'an are growing, especially in the translation of the Quran in the Indonesia Language. This research was created at the same time to create datasets such as previously conducted research, in order to hope that future research with the focus of other discussions can use this dataset to help with the research. The study uses 6236 number of verses and from the number of such verses, the system gets 2193 for nouns and 1733 for verbs. The amount is processed using the Sim-rail vector method, a questionnaire against 15 respondents and gold standard, to get the performance value measured using Spearman Rank and get a correlation result of 0.909. Keywords — Natural Language Processing; Distribution Semantic Model; Sim-Rel Vector; Spearman Rank

Download Full-text

Latent Dirichlet Allocation in predicting clinical trial terminations

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0973-y ◽

2019 ◽

Vol 19 (1) ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

An Analysis of Machine Learning Algorithms and Deep Neural Networks for Email Spam Classification using Natural Language Processing

10.1109/soli54607.2021.9672398 ◽

2021 ◽

Author(s):

Md. Mohidul Hasan ◽

Syed Mahbubuz Zaman ◽

Md. Asif Talukdar ◽

Ayesha Siddika ◽

Md. Golam Rabiul Alam

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Deep Neural Networks ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Email Spam

Download Full-text

Natural Language Processing as Feature Extraction Method for Building Better Predictive Models

Advances in Linguistics and Communication Studies - Modern Computational Models of Semantic Discovery in Natural Language ◽

10.4018/978-1-4666-8690-8.ch006 ◽

2015 ◽

pp. 141-166 ◽

Cited By ~ 1

Author(s):

Goran Klepac ◽

Marko Velić

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Case Studies ◽

Language Processing ◽

Call Center ◽

Future Research ◽

Feature Extraction Method ◽

First Case ◽

Grammar Rules ◽

Mining Projects

This chapter covers natural language processing techniques and their application in predicitve models development. Two case studies are presented. First case describes a project where textual descriptions of various situations in call center of one telecommunication company were processed in order to predict churn. Second case describes sentiment analysis of business news and describes practical and testing issues in text mining projects. Both case studies depict different approaches and are implemented in different tools. Language of the texts processed in these projects is Croatian which belongs to the Slavic group of languages with more complex morphologies and grammar rules than English. Chapter concludes with several points on the future research possible in this domain.

Download Full-text

VivesDebate: A New Annotated Multilingual Corpus of Argumentation in a Debate Tournament

Applied Sciences ◽

10.3390/app11157160 ◽

2021 ◽

Vol 11 (15) ◽

pp. 7160

Author(s):

Ramon Ruiz-Dolz ◽

Montserrat Nofre ◽

Mariona Taulé ◽

Stella Heras ◽

Ana García-Fornes

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Deep Neural Networks ◽

State Of The Art ◽

Argument Analysis ◽

Argument Evaluation ◽

Learning Techniques ◽

Multilingual Corpus ◽

Computational Argumentation

The application of the latest Natural Language Processing breakthroughs in computational argumentation has shown promising results, which have raised the interest in this area of research. However, the available corpora with argumentative annotations are often limited to a very specific purpose or are not of adequate size to take advantage of state-of-the-art deep learning techniques (e.g., deep neural networks). In this paper, we present VivesDebate, a large, richly annotated and versatile professional debate corpus for computational argumentation research. The corpus has been created from 29 transcripts of a debate tournament in Catalan and has been machine-translated into Spanish and English. The annotation contains argumentative propositions, argumentative relations, debate interactions and professional evaluations of the arguments and argumentation. The presented corpus can be useful for research on a heterogeneous set of computational argumentation underlying tasks such as Argument Mining, Argument Analysis, Argument Evaluation or Argument Generation, among others. All this makes VivesDebate a valuable resource for computational argumentation research within the context of massive corpora aimed at Natural Language Processing tasks.

Download Full-text

Latent Dirichlet Allocation in Predicting Clinical Trial Terminations

10.21203/rs.2.12904/v4 ◽

2019 ◽

Author(s):

Simon Geletta ◽

Lendie Follett ◽

Marcia R Laugerman

Keyword(s):

Clinical Trial ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Structured Data ◽

Unstructured Data ◽

Future Research ◽

Funding Agencies ◽

Dirichlet Allocation

Abstract This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least ten percent of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.Method: We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.Results: In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.Conclusions: Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.

Download Full-text

A Review on the Detection of Offensive Content in Social Media Platforms

FUOYE Journal of Engineering and Technology ◽

10.46792/fuoyejet.v6i1.591 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Solomon Akinboro ◽

Oluwadamilola Adebusoye ◽

Akintoye Onamade

Keyword(s):

Social Media ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Hybrid Methods ◽

Future Research ◽

Learning Approaches ◽

Hybrid Approaches ◽

Use Of Social Media

Offensive content refers to messages which are socially unacceptable including vulgar or derogatory messages. As the use of social media increases worldwide, social media administrators are faced with the challenges of tackling the inclusion of offensive content, to ensure clean and non-abusive or offensive conversations on the platforms they provide. This work organizes and describes techniques used for the automated detection of offensive languages in social media content in recent times, providing a structured overview of previous approaches, including algorithms, methods and main features used. Selection was from peer-reviewed articles on Google scholar. Search terms include: Profane words, natural language processing, multilingual context, hybrid methods for detecting profane words and deep learning approach for detecting profane words. Exclusions were made based on some criteria. Initial search returned 203 of which only 40 studies met the inclusion criteria; 6 were on natural language processing, 6 studies were on Deep learning approaches, 5 reports analysed hybrid approaches, multi-level classification/multi-lingual classification appear in 13 reports while 10 reports were on other related methods.The limitations of previous efforts to tackle the challenges with regards to the detection of offensive contents are highlighted to aid future research in this area. Keywords— algorithm, offensive content, profane words, social media, texts

Download Full-text