Promises of text processing: natural language processing meets AI

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

Machine Learning in Natural Language Processing

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch014 ◽

2010 ◽

pp. 302-324

Author(s):

Marina Sokolova ◽

Stan Szpakowicz

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Word Sense Disambiguation ◽

Machine Learning Techniques ◽

Word Sense ◽

Part Of Speech ◽

Applications Of Machine Learning

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.

Download Full-text

Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing

Yearbook of Medical Informatics ◽

10.15265/iy-2016-017 ◽

2016 ◽

Vol 25 (01) ◽

pp. 224-233 ◽

Cited By ~ 11

Author(s):

N. Elhadad ◽

D. Demner-Fushman

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Analysis ◽

Disease Modeling ◽

Text Processing ◽

Healthcare Quality ◽

Unintended Consequences ◽

Health Related

Summary Objectives: This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts. Methods: We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations. Results: Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community-wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues. Conclusions: Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.

Download Full-text

Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text

Yearbook of Medical Informatics ◽

10.15265/iy-2017-029 ◽

2017 ◽

Vol 26 (01) ◽

pp. 214-227 ◽

Cited By ~ 29

Author(s):

G. Gonzalez-Hernandez ◽

A. Sarker ◽

K. O’Connor ◽

G. Savova

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Science ◽

Open Systems ◽

Text Processing ◽

Research Progress ◽

Research Problems ◽

Health Related

Summary Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.

Download Full-text

Comparative Study of The Performance of Various Classifiers in Labeling Non-Functional Requirements

Information Technology And Control ◽

10.5755/j01.itc.48.3.21973 ◽

2019 ◽

Vol 48 (3) ◽

pp. 432-445 ◽

Cited By ~ 1

Author(s):

Laszlo Toth ◽

Laszlo Vidacs

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Software Engineering ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Software Systems ◽

Functional Requirements ◽

Natural Languages ◽

System Analyst

Software systems are to be developed based on expectations of customers. These expectations are expressed using natural languages. To design a software meeting the needs of the customer and the stakeholders, the intentions, feedbacks and reviews are to be understood accurately and without ambiguity. These textual inputs often contain inaccuracies, contradictions and are seldom given in a well-structured form. The issues mentioned in the previous thought frequently result in the program not satisfying the expectation of the stakeholders. In particular, for non-functional requirements, clients rarely emphasize these specifications as much as they might be justified. Identifying, classifying and reconciling the requirements is one of the main duty of the System Analyst, which task, without using a proper tool, can be very demanding and time-consuming. Tools which support text processing are expected to improve the accuracy of identification and classification of requirements even in an unstructured set of inputs. System Analysts can use them also in document archeology tasks where many documents, regulations, standards, etc. have to be processed. Methods elaborated in natural language processing and machine learning offer a solid basis, however, their usability and the possibility to improve the performance utilizing the specific knowledge from the domain of the software engineering are to be examined thoroughly. In this paper, we present the results of our work adapting natural language processing and machine learning methods for handling and transforming textual inputs of software development. The major contribution of our work is providing a comparison of the performance and applicability of the state-of-the-art techniques used in natural language processing and machine learning in software engineering. Based on the results of our experiments, tools can be designed which can support System Analysts working on textual inputs.

Download Full-text

POS Tagging Bahasa Madura dengan Menggunakan Algoritma Brill Tagger

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020722449 ◽

2020 ◽

Vol 7 (6) ◽

pp. 1121

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Digital Media ◽

Language Processing ◽

Text Processing ◽

Pos Tagging ◽

Part Of Speech ◽

Average Accuracy ◽

Conducting Research ◽

Regional Languages

Bahasa Madura adalah bahasa daerah yang selain digunakan di Pulau Madura juga digunakan di daerah lainnya seperti di kota Jember, Pasuruan, dan Probolinggo. Sebagai bahasa daerah, Bahasa Madura mulai banyak ditinggalkan khususnya di kalangan anak muda. Beberapa penyebabnya adalah adanya rasa gengsi dan tingkat kesulitan untuk mempelajari Bahasa Madura yang memiliki ragam dialek dan tingkat bahasa. Berkurangnya penggunaan Bahasa Madura dapat mengakibatkan punahnya Bahasa Madura sebagai salah satu bahasa daerah yang ada di Indonesia. Oleh karena itu, perlu adanya usaha untuk mempertahankan dan memelihara Bahasa Madura. Salah satunya adalah dengan melakukan penelitian tentang Bahasa Madura dalam bidang Natural Language Processing sehingga kedepannya pembelajaran tentang Bahasa Madura dapat dilakukan melalui media digital. Part Of Speech (POS) Tagging adalah dasar penelitian text processing, sehingga perlu untuk dibuat aplikasi POS Tagging Bahasa Madura untuk digunakan pada penelitian Natural Languange Processing lainnya. Dalam penelitian ini, POS Tagging dibuat dengan menggunakan Algoritma Brill Tagger dengan menggunakan corpus yang berisi 10.535 kata Bahasa Madura. POS Tagging dengan Brill Tagger dapat memberikan kelas kata yang sesuai pada kata dengan menggunakan aturan leksikal dan kontekstual. Brill Tagger merupakan algoritma dengan tingkat akurasi yang paling baik saat diterapkan dalam Bahasa Inggris, Bahasa Indonesia dan beberapa bahasa lainnya. Dari serangkaian percobaan dengan beberapa perubahan nilai threshold tanpa memperhatikan OOV (Out Of Vocabulary), menunjukkan rata-rata akurasi mencapai lebih dari 80% dengan akurasi tertinggi mencapai 86.67% dan untuk pengujian dengan memperhatikan OOV mencapai rata-rata akurasi 67.74%. Jadi dapat disimpulkan bahwa Brill Tagger dapat digunakan untuk Bahasa Madura dengan tingkat akurasi yang baik. Abstract Bahasa Madura is regional language which is not only used on Madura Island but is also used in other areas such as in several regions in Jember, Pasuruan, and Probolinggo. Today, Bahasa Madura began to be abandoned, especially among young people. One reason is sense of pride and also quite difficult to learn Bahasa Madura because it has a variety of dialects and language levels. The reduced use of Bahasa Madura can lead to the extinction of Bahasa Madura as one of the regional languages in Indonesia. Therefore, there needs to be an effort to maintain Madurese Language. One of them is by conducting research on Madurese Language in the field of Natural Language Processing so that in the future learning about Madurese can be done through digital media. Part of Speech (POS) Tagging is the basis of text processing research, so the Madura Language POS Tagging application needs to be made for use in other Natural Language Processing research. This study uses Brill Tagger by using a corpus containing 10,535 words. POS Tagging with Brill Tagger Algorithm can provide the appropriate word class to word using lexical and contextual rule. The reason for using Brill Tagger is because it is the algorithm that has the best accuracy when implemented in English, Indonesian and several other languages. The experimental results with Brill Tagger show that the average accuracy without OOV (Out Of Vocabulary) obtained is 86.6% with the highest accuracy of 86.94% and the average accuracy for OOV words reached 67.22%. So it can be concluded that the Brill Tagger Algorithm can also be used for Bahasa Madura with a good degree of accuracy.

Download Full-text

Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text

Yearbook of Medical Informatics ◽

10.1055/s-0037-1606506 ◽

2017 ◽

Vol 26 (01) ◽

pp. 214-227

Author(s):

G. Gonzalez-Hernandez ◽

A. Sarker ◽

K. O’Connor ◽

G. Savova

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Science ◽

Open Systems ◽

Text Processing ◽

Research Progress ◽

Research Problems ◽

Health Related

Summary Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.

Download Full-text

Automation of solving planimetry problems written in Ukrainian

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2020.04.071 ◽

2020 ◽

pp. 071-080

Author(s):

O.P. Zhezherun ◽

◽

O.R. Smysh ◽

◽

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Comprehensive Analysis ◽

Text Representation ◽

Natural Languages ◽

Different Types ◽

Mathematical Problems ◽

Further Development

The article focuses on developing a software solution for solving planimetry problems that are written in Ukrainian. We discuss tendencies and available abilities in Ukrainian natural language processing. Presenting a comprehensive analysis of different types of describing a problem, which shows regularities in the formulation and structure of the text representation of problems. Also, we demonstrate the similarities of writing a problem not only in Ukrainian but also in Belarusian, English, and Russian languages. The final result of the paper is a system that uses the morphosyntactic analyzer to process a problem’s text and provide the answer to it. Ukrainian natural language processing is growing rapidly and showing impressive results. Huge possibilities appear as the Gold standard annotated corpus for Ukrainian language was recently developed. The created architecture is flexible, which indicates the possibility of adding both new geometry figures and their properties, as well as the additional logic to the program. The developed system with a little reformatting can be used with other natural languages, such as English, Belarusian or Russian, as the algorithm for text processing is universal due to the globally accepted representations for presenting such types of mathematical problems. Therefore, the further development of the system is possible.

Download Full-text

A scalable architecture for data-intensive natural language processing

Natural Language Engineering ◽

10.1017/s1351324917000092 ◽

2017 ◽

Vol 23 (5) ◽

pp. 709-731 ◽

Cited By ~ 1

Author(s):

ZUHAITZ BELOKI ◽

XABIER ARTOLA ◽

AITOR SOROA

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Radical Change ◽

Time Frame ◽

Third Party ◽

Scalable Architecture ◽

Distributed Programs ◽

Processing Chain

AbstractComputational power needs have greatly increased during the last years, and this is also the case in the Natural Language Processing (NLP) area, where thousands of documents must be processed, i.e., linguistically analyzed, in a reasonable time frame. These computing needs have implied a radical change in the computing architectures and big-scale text processing techniques used in NLP. In this paper, we present a scalable architecture for distributed language processing. The architecture uses Storm to combine diverse NLP modules into a processing chain, which carries out the linguistic analysis of documents. Scalability requires designing solutions that are able to run distributed programs in parallel and across large machine clusters. Using the architecture presented here, it is possible to integrate a set of third-party NLP modules into a unique processing chain which can be deployed onto a distributed environment, i.e., a cluster of machines, so allowing the language-processing modules run in parallel. No restrictions are placed a priori on the NLP modules apart of being able to consume and produce linguistic annotations following a given format. We show the feasibility of our approach by integrating two linguistic processing chains for English and Spanish. Moreover, we provide several scripts that allow building from scratch a whole distributed architecture that can be then easily installed and deployed onto a cluster of machines. The scripts and the NLP modules used in the paper are publicly available and distributed under free licenses. In the paper, we also describe a series of experiments carried out in the context of the NewsReader project with the goal of testing how the system behaves in different scenarios.

Download Full-text

AI-Natural Language Processing (NLP)

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.37293 ◽

2021 ◽

Vol 9 (VIII) ◽

pp. 135-140

Author(s):

Miss. Aliya Anam Shoukat Ali

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Question Answering ◽

Text Processing ◽

Human Language ◽

Spell Check ◽

Email Spam

Natural Language Processing (NLP) could be a branch of Artificial Intelligence (AI) that allows machines to know the human language. Its goal is to form systems that can make sense of text and automatically perform tasks like translation, spell check, or topic classification. Natural language processing (NLP) has recently gained much attention for representing and analysing human language computationally. It's spread its applications in various fields like computational linguistics, email spam detection, information extraction, summarization, medical, and question answering etc. The goal of the Natural Language Processing is to style and build software system which will analyze, understand, and generate languages that humans use naturally, so as that you just could also be ready to address your computer as if you were addressing another person. Because it’s one amongst the oldest area of research in machine learning it’s employed in major fields like artificial intelligence speech recognition and text processing. Natural language processing has brought major breakthrough within the sector of COMPUTATION AND AI.

Download Full-text