Predicting Catastrophic Events Using Machine Learning Models for Natural Language Processing

In recent years, there has been widespread improvement in communication technologies. Social media applications like Twitter have made it much easier for people to send and receive information. A direct application of this can be seen in the cases of disaster prediction and crisis. With people being able to share their observations, they can help spread the message of caution. However, the identification of warnings and analyzing the seriousness of text is not an easy task. Natural language processing (NLP) is one way that can be used to analyze various tweets for the same. Over the years, various NLP models have been developed that are capable of providing high accuracy when it comes to data prediction. In the chapter, the authors will analyze various NLP models like logistic regression, naive bayes, XGBoost, LSTM, and word embedding technologies like GloVe and transformer encoder like BERT for the purpose of predicting disaster warnings from the scrapped tweets. The authors focus on finding the best disaster prediction model that can help in warning people and the government.

Download Full-text

Triage and diagnosis of COVID-19 from medical social media (Preprint)

10.2196/preprints.30397 ◽

2021 ◽

Author(s):

Abul Hasan ◽

Mark Levene ◽

David Weston ◽

Renate Fromson ◽

Nicolas Koslover ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Rule Based ◽

Additional Information ◽

Processing Pipeline ◽

Machine Learning Models

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

Government plans in the 2016 and 2021 Peruvian presidential elections: A natural language processing analysis of the health chapters

Wellcome Open Research ◽

10.12688/wellcomeopenres.16867.3 ◽

2021 ◽

Vol 6 ◽

pp. 177

Author(s):

Rodrigo M. Carrillo-Larco ◽

Manuel Castillo-Cara ◽

Jesús Lovón-Melgarejo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Universal Health Coverage ◽

Presidential Elections ◽

Clinical Medicine ◽

Health Coverage ◽

Keywords And Phrases ◽

The Government ◽

Universal Health

Background: While clinical medicine has exploded, electronic health records for Natural Language Processing (NLP) analyses, public health, and health policy research have not yet adopted these algorithms. We aimed to dissect the health chapters of the government plans of the 2016 and 2021 Peruvian presidential elections, and to compare different NLP algorithms. Methods: From the government plans (18 in 2016; 19 in 2021) we extracted each sentence from the health chapters. We used five NLP algorithms to extract keywords and phrases from each plan: Term Frequency–Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), TextRank, Keywords Bidirectional Encoder Representations from Transformers (KeyBERT), and Rapid Automatic Keywords Extraction (Rake). Results: In 2016 we analysed 630 sentences, whereas in 2021 there were 1,685 sentences. The TF-IDF algorithm showed that in 2016, 22 terms appeared with a frequency of 0.05 or greater, while in 2021 27 terms met this criterion. The LDA algorithm defined two groups. The first included terms related to things the population would receive (e.g., ’insurance’), while the second included terms about the health system (e.g., ’capacity’). In 2021, most of the government plans belonged to the second group. The TextRank analysis provided keywords showing that ’universal health coverage’ appeared frequently in 2016, while in 2021 keywords about the COVID-19 pandemic were often found. The KeyBERT algorithm provided keywords based on the context of the text. These keywords identified some underlying characteristics of the political party (e.g., political spectrum such as left-wing). The Rake algorithm delivered phrases, in which we found ’universal health coverage’ in 2016 and 2021. Conclusion: The NLP analysis could be used to inform on the underlying priorities in each government plan. NLP analysis could also be included in research of health policies and politics during general elections and provide informative summaries for the general population.

Download Full-text

Application of Natural Language Processing (NLP) Techniques in E–Governance

E-Government Development and Diffusion ◽

10.4018/978-1-60566-713-3.ch008 ◽

2009 ◽

pp. 122-132 ◽

Cited By ~ 2

Author(s):

Siddhartha Ghosh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information And Communication Technologies ◽

Citizen Participation ◽

Language Processing ◽

Good Governance ◽

Communication Technologies ◽

The Public ◽

Information And Communication ◽

The Masses

E-governance is the public sector’s use of information and communication technologies (ICT) with the aim of improving information and service delivery, encouraging citizen participation in the decision-making process and making government more accountable, transparent, and effective. Effective and efficient e-governments deploy information and communication technology systems to deliver services through multiple channels that are accessible, fast, secure, reliable, seamless, and coherent. To implement better government-to-government (G2G), government-to-business (G2B), government-to-enterprise (G2E) and government-to-citizen (G2C) services a good governance should not only utilize ICT, it has to be also serious about implementing natural language processing (NLP) Techniques to reach up to the masses and make e-governance successful one. This chapter shows the need of applying NLP technologies in the field of e-governance and also tries to focus on the issues, which can be resolved very easily with the help of these modern technologies. It also shows the advantages of applying NLP in e-governance.

Download Full-text

Estimating Nonfatal Gunshot Injury Locations With Natural Language Processing and Machine Learning Models

JAMA Network Open ◽

10.1001/jamanetworkopen.2020.20664 ◽

2020 ◽

Vol 3 (10) ◽

pp. e2020664

Author(s):

Susan T. Parker

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gunshot Injury ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Klasifikasi Laporan Keluhan Pelayanan Publik Berdasarkan Instansi Menggunakan Metode LDA-SVM

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021863768 ◽

2021 ◽

Vol 8 (6) ◽

pp. 1265

Author(s):

Muhammad Alkaff ◽

Andreyan Rizky Baskara ◽

Irham Maulani

Keyword(s):

Support Vector Machine ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Confusion Matrix ◽

Support Vector ◽

Topic Distribution ◽

The Government ◽

Dirichlet Allocation

Sebuah sistem layanan untuk menyampaikan aspirasi dan keluhan masyarakat terhadap layanan pemerintah Indonesia, bernama Lapor! Pemerintah sudah lama memanfaatkan sistem tersebut untuk menjawab permasalahan masyarakat Indonesia terkait permasalahan birokrasi. Namun, peningkatan volume laporan dan pemilahan laporan yang dilakukan oleh operator dengan membaca setiap keluhan yang masuk melalui sistem menyebabkan sering terjadi kesalahan dimana operator meneruskan laporan tersebut ke instansi yang salah. Oleh karena itu, diperlukan suatu solusi yang dapat menentukan konteks laporan secara otomatis dengan menggunakan teknik Natural Language Processing. Penelitian ini bertujuan untuk membangun klasifikasi laporan secara otomatis berdasarkan topik laporan yang ditujukan kepada instansi yang berwenang dengan menggabungkan metode Latent Dirichlet Allocation (LDA) dan Support Vector Machine (SVM). Proses pemodelan topik untuk setiap laporan dilakukan dengan menggunakan metode LDA. Metode ini mengekstrak laporan untuk menemukan pola tertentu dalam dokumen yang akan menghasilkan keluaran dalam nilai distribusi topik. Selanjutnya, proses klasifikasi untuk menentukan laporan agensi tujuan dilakukan dengan menggunakan SVM berdasarkan nilai topik yang diekstraksi dengan metode LDA. Performa model LDA-SVM diukur dengan menggunakan confusion matrix dengan menghitung nilai akurasi, presisi, recall, dan F1 Score. Hasil pengujian menggunakan teknik split train-test dengan skor 70:30 menunjukkan bahwa model menghasilkan kinerja yang baik dengan akurasi 79,85%, presisi 79,98%, recall 72,37%, dan Skor F1 74,67%. AbstractA service system to convey aspirations and complaints from the public against Indonesia's government services, named Lapor! The Government has used the Government for a long time to answer the problems of the Indonesian people related to bureaucratic problems. However, the increasing volume of reports and the sorting of reports carried out by operators by reading every complaint that comes through the system cause frequent errors where operators forward the reports to the wrong agencies. Therefore, we need a solution that can automatically determine the report's context using Natural Language Processing techniques. This study aims to build automatic report classifications based on report topics addressed to authorized agencies by combining Latent Dirichlet Allocation (LDA) and Support Vector Machine (SVM). The topic-modeling process for each report was carried out using the LDA method. This method extracts reports to find specific patterns in documents that will produce output in topic distribution values. Furthermore, the classification process to determine the report's destination agency carried out using the SVM based on the value of the topics extracted by the LDA method. The LDA-SVM model's performance is measured using a confusion matrix by calculating the value of accuracy, precision, recall, and F1 Score. The test results using the train-test split technique with a 70:30 show that the model produces good performance with 79.85% accuracy, 79.98% precision, 72.37% recall, and 74.67% F1 Score

Download Full-text

A SYSTEMATIC READING IN STATISTICAL TRANSLATION: FROM THE STATISTICAL MACHINE TRANSLATION TO THE NEURAL TRANSLATION MODELS.

Journal of Information and Communication Technology ◽

10.32890/jict2017.16.2.8239 ◽

2017 ◽

Author(s):

Zakaria El Maazouzi ◽

Badr Eddine EL Mohajir ◽

Mohammed Al Achhab

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Statistical Machine Translation ◽

High Accuracy ◽

Neural Machine Translation ◽

Translation Quality ◽

Automatic Translation

Achieving high accuracy in automatic translation tasks has been one of the challenging goals for researchers in the area of machine translation since decades. Thus, the eagerness of exploring new possible ways to improve machine translation was always the matter for researchers in the field. Automatic translation as a key application in the natural language processing domain has developed many approaches, namely statistical machine translation and recently neural machine translation that improved largely the translation quality especially for Latin languages. They have even made it possible for the translation of some language pairs to approach human translation quality. In this paper, we present a survey of the state of the art of statistical translation, where we describe the different existing methodologies, and we overview the recent research studies while pointing out the main strengths and limitations of the different approaches.

Download Full-text

Standardization of Featureless Variables for Machine Learning Models Using Natural Language Processing

Lecture Notes in Computer Science - Computational Science – ICCS 2018 ◽

10.1007/978-3-319-93701-4_18 ◽

2018 ◽

pp. 234-246

Author(s):

Kourosh Modarresi ◽

Abdurrahman Munir

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Government plans in the 2016 and 2021 Peruvian presidential elections: A natural language processing analysis of the health chapters

Wellcome Open Research ◽

10.12688/wellcomeopenres.16867.1 ◽

2021 ◽

Vol 6 ◽

pp. 177

Author(s):

Rodrigo M. Carrillo-Larco ◽

Manuel Castillo-Cara ◽

Jesús Lovón-Melgarejo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Universal Health Coverage ◽

Presidential Elections ◽

Clinical Medicine ◽

Health Coverage ◽

Keywords And Phrases ◽

The Government ◽

Universal Health

Background: While clinical medicine has exploded, electronic health records for Natural Language Processing (NLP) analyses, public health, and health policy research have not yet adopted these algorithms. We aimed to dissect the health chapters of the government plans of the 2016 and 2021 Peruvian presidential elections, and to compare different NLP algorithms. Methods: From the government plans (18 in 2016; 19 in 2021) we extracted each sentence from the health chapters. We used five NLP algorithms to extract keywords and phrases from each plan: Term Frequency–Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), TextRank, Keywords Bidirectional Encoder Representations from Transformers (KeyBERT), and Rapid Automatic Keywords Extraction (Rake). Results: In 2016 we analysed 630 sentences, whereas in 2021 there were 1,685 sentences. The TF-IDF algorithm showed that in 2016, nine terms appeared with a frequency of 0.10 or greater, while in 2021 43 terms met this criterion. The LDA algorithm defined two groups. The first included terms related to things the population would receive (e.g., ’insurance’), while the second included terms about the health system (e.g., ’capacity’). In 2021, most of the government plans belonged to the second group. The TextRank analysis provided keywords showing that ’universal health coverage’ appeared frequently in 2016, while in 2021 keywords about the COVID-19 pandemic were often found. The KeyBERT algorithm provided keywords based on the context of the text. These keywords identified some underlying characteristics of the political party (e.g., political spectrum such as left-wing). The Rake algorithm delivered phrases, in which we found ’universal health coverage’ in 2016 and 2021. Conclusion: The NLP analysis could be used to inform on the underlying priorities in each government plan. NLP analysis could also be included in research of health policies and politics during general elections and provide informative summaries for the general population.

Download Full-text

Citation Classification Using Natural Language Processing and Machine Learning Models

Advances in Smart Technologies Applications and Case Studies - Lecture Notes in Electrical Engineering ◽

10.1007/978-3-030-53187-4_39 ◽

2020 ◽

pp. 357-365

Author(s):

Syyab Rahi ◽

Iqra Safder ◽

Sehrish Iqbal ◽

Saeed-Ul Hassan ◽

Iain Reid ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Natural language processing for abstraction of cancer treatment toxicities: accuracy versus human experts

JAMIA Open ◽

10.1093/jamiaopen/ooaa064 ◽

2020 ◽

Author(s):

Julian C Hong ◽

Andrew T Fairchild ◽

Jarred P Tanksley ◽

Manisha Palta ◽

Jessica D Tenenbaum

Keyword(s):

Radiation Therapy ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Analysis ◽

High Accuracy ◽

Radiation Dermatitis ◽

Clinical Text ◽

Clinical Notes ◽

Oncology Research

Abstract Objectives Expert abstraction of acute toxicities is critical in oncology research but is labor-intensive and variable. We assessed the accuracy of a natural language processing (NLP) pipeline to extract symptoms from clinical notes compared to physicians. Materials and Methods Two independent reviewers identified present and negated National Cancer Institute Common Terminology Criteria for Adverse Events (CTCAE) v5.0 symptoms from 100 randomly selected notes for on-treatment visits during radiation therapy with adjudication by a third reviewer. A NLP pipeline based on Apache clinical Text Analysis Knowledge Extraction System was developed and used to extract CTCAE terms. Accuracy was assessed by precision, recall, and F1. Results The NLP pipeline demonstrated high accuracy for common physician-abstracted symptoms, such as radiation dermatitis (F1 0.88), fatigue (0.85), and nausea (0.88). NLP had poor sensitivity for negated symptoms. Conclusion NLP accurately detects a subset of documented present CTCAE symptoms, though is limited for negated symptoms. It may facilitate strategies to more consistently identify toxicities during cancer therapy.

Download Full-text