scholarly journals Mapping the plague through natural language processing

Author(s):  
Fabienne Krauer ◽  
Boris V. Schmid

AbstractPlague has caused three major pandemics with millions of casualties in the past centuries. There is a substantial amount of historical and modern primary and secondary literature about the spatial and temporal extent of epidemics, circumstances of transmission or symptoms and treatments. Many quantitative analyses rely on structured data, but the extraction of specific information such as the time and place of outbreaks is a tedious process. Machine learning algorithms for natural language processing (NLP) can potentially facilitate the establishment of datasets, but their use in plague research has not been explored much yet. We investigated the performance of five pre-trained NLP libraries (Google NLP, Stanford CoreNLP, spaCy, germaNER and Geoparser.io) for the extraction of location data from a German plague treatise published in 1908 compared to the gold standard of manual annotation. Of all tested algorithms, we found that Stanford CoreNLP had the best overall performance but spaCy showed the highest sensitivity. Moreover, we demonstrate how word associations can be extracted and displayed with simple text mining techniques in order to gain a quick insight into salient topics. Finally, we compared our newly digitised plague dataset to a re-digitised version of the famous Biraben plague list and update the spatio-temporal extent of the second pandemic plague mentions. We conclude that all NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database.

2019 ◽  
Vol 8 (4) ◽  
pp. 10289-10293

Sentiment Analysis is a tool used for determining the Polarity or Emotion of a Sentence. It is a field of Natural Language Processing which focuses on the study of opinions. In this study, the researchers solved one key challenge in Sentiment Analysis, which is to consider the Ending Punctuation Marks present in a sentence. Ending punctuation marks plays a significant role in Emotion Recognition and Intensity Level Recognition. The research made used of tweets expressing opinions about Philippine President Rodrigo Duterte. These downloaded tweets served as the inputs. It was initially subjected to pre-processing stage to be able to prepare the sentences for processing. A Language Model was created to serve as the classifier for determining the scores of the tweets. The scores give the polarity of the sentence. Accuracy is very important in sentiment analysis. To increase the chance of correctly identifying the polarity of the tweets, the input undergone Intensity Level Recognition which determines the intensifiers and negations within the sentences. The system was evaluated with overall performance of 80.27%.


Author(s):  
Rashida Ali ◽  
Ibrahim Rampurawala ◽  
Mayuri Wandhe ◽  
Ruchika Shrikhande ◽  
Arpita Bhatkar

Internet provides a medium to connect with individuals of similar or different interests creating a hub. Since a huge hub participates on these platforms, the user can receive a high volume of messages from different individuals creating a chaos and unwanted messages. These messages sometimes contain a true information and sometimes false, which leads to a state of confusion in the minds of the users and leads to first step towards spam messaging. Spam messages means an irrelevant and unsolicited message sent by a known/unknown user which may lead to a sense of insecurity among users. In this paper, the different machine learning algorithms were trained and tested with natural language processing (NLP) to classify whether the messages are spam or ham.


IoT ◽  
2020 ◽  
Vol 1 (2) ◽  
pp. 218-239 ◽  
Author(s):  
Ravikumar Patel ◽  
Kalpdrum Passi

In the derived approach, an analysis is performed on Twitter data for World Cup soccer 2014 held in Brazil to detect the sentiment of the people throughout the world using machine learning techniques. By filtering and analyzing the data using natural language processing techniques, sentiment polarity was calculated based on the emotion words detected in the user tweets. The dataset is normalized to be used by machine learning algorithms and prepared using natural language processing techniques like word tokenization, stemming and lemmatization, part-of-speech (POS) tagger, name entity recognition (NER), and parser to extract emotions for the textual data from each tweet. This approach is implemented using Python programming language and Natural Language Toolkit (NLTK). A derived algorithm extracts emotional words using WordNet with its POS (part-of-speech) for the word in a sentence that has a meaning in the current context, and is assigned sentiment polarity using the SentiWordNet dictionary or using a lexicon-based method. The resultant polarity assigned is further analyzed using naïve Bayes, support vector machine (SVM), K-nearest neighbor (KNN), and random forest machine learning algorithms and visualized on the Weka platform. Naïve Bayes gives the best accuracy of 88.17% whereas random forest gives the best area under the receiver operating characteristics curve (AUC) of 0.97.


Author(s):  
Anurag Langan

Grading student answers is a tedious and time-consuming task. A study had found that almost on average around 25% of a teacher's time is spent in scoring the answer sheets of students. This time could be utilized in much better ways if computer technology could be used to score answers. This system will aim to grade student answers using the various Natural Language processing techniques and Machine Learning algorithms available today.


Essay writing examination is commonly used learning activity in all levels of education and disciplines. It is advantageous in evaluating the student’s learning outcomes because it gives them the chance to exhibit their knowledge and skills freely. For these reasons, a lot of researchers turned their interest in Automated essay scoring (AES) is one of the most remarkable innovations in text mining using Natural Language Processing and Machine learning algorithms. The purpose of this study is to develop an automated essay scoring that uses ontology and Natural Language Processing. Different learning algorithms showed agreeing prediction outcomes but still regression algorithm with the proper features incorporated with it may produce more accurate essay score. This study aims to increase the accuracy, reliability and validity of the AES by implementing the Gradient ridge regression with the domain ontology and other features. Linear regression, linear lasso regression and ridge regression were also used in conjunction with the different features that was extracted. The different features extracted are the domain concepts, average word length, orthography (spelling mistakes), grammar and sentiment score. The first dataset used is the ASAP dataset from Kaggle website is used to train and test different machine learning algorithms that is consist of linear regression, linear lasso regression, ridge regression and gradient boosting regression together with the different features identified. The second dataset used is the one extracted from the student’s essay exam in Human Computer Interaction course. The results show that the Gradient Boosting Regression has the highest variance and kappa scores. However, we can tell that there are similarities when it comes to performances for Linear, Ridge and Lasso regressions due to the dataset used which is ASAP. Furthermore, the results were evaluated using Cohen Weighted Kappa (CWA) score and compared the agreement between the human raters. The CWA result is 0.659 that can be interpreted as Strong level of agreement between the Human Grader and the automated essay score. Therefore, the proposed AES has 64-81% reliability level.


JAMIA Open ◽  
2019 ◽  
Vol 2 (1) ◽  
pp. 139-149 ◽  
Author(s):  
Meijian Guan ◽  
Samuel Cho ◽  
Robin Petro ◽  
Wei Zhang ◽  
Boris Pasche ◽  
...  

Abstract Objectives Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients. Methods We obtained 5889 deidentified progress reports (2439 words on average) for 755 cancer patients who have undergone a clinical next generation sequencing (NGS) testing in Wake Forest Baptist Comprehensive Cancer Center for our data analyses. An NLP system was implemented to process the free-text data and extract NGS-related information. Three types of recurrent neural network (RNN) namely, gated recurrent unit, long short-term memory (LSTM), and bidirectional LSTM (LSTM_Bi) were applied to classify documents to the treatment-change and no-treatment-change groups. Further, we compared the performances of RNNs to 5 machine learning algorithms including Naive Bayes, K-nearest Neighbor, Support Vector Machine for classification, Random forest, and Logistic Regression. Results Our results suggested that, overall, RNNs outperformed traditional machine learning algorithms, and LSTM_Bi showed the best performance among the RNNs in terms of accuracy, precision, recall, and F1 score. In addition, pretrained word embedding can improve the accuracy of LSTM by 3.4% and reduce the training time by more than 60%. Discussion and Conclusion NLP and RNN-based text mining solutions have demonstrated advantages in information retrieval and document classification tasks for unstructured clinical progress notes.


2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Krishnadas Nanath ◽  
Supriya Kaitheri ◽  
Sonia Malik ◽  
Shahid Mustafa

Purpose The purpose of this paper is to examine the factors that significantly affect the prediction of fake news from the virality theory perspective. The paper looks at a mix of emotion-driven content, sentimental resonance, topic modeling and linguistic features of news articles to predict the probability of fake news. Design/methodology/approach A data set of over 12,000 articles was chosen to develop a model for fake news detection. Machine learning algorithms and natural language processing techniques were used to handle big data with efficiency. Lexicon-based emotion analysis provided eight kinds of emotions used in the article text. The cluster of topics was extracted using topic modeling (five topics), while sentiment analysis provided the resonance between the title and the text. Linguistic features were added to the coding outcomes to develop a logistic regression predictive model for testing the significant variables. Other machine learning algorithms were also executed and compared. Findings The results revealed that positive emotions in a text lower the probability of news being fake. It was also found that sensational content like illegal activities and crime-related content were associated with fake news. The news title and the text exhibiting similar sentiments were found to be having lower chances of being fake. News titles with more words and content with fewer words were found to impact fake news detection significantly. Practical implications Several systems and social media platforms today are trying to implement fake news detection methods to filter the content. This research provides exciting parameters from a viral theory perspective that could help develop automated fake news detectors. Originality/value While several studies have explored fake news detection, this study uses a new perspective on viral theory. It also introduces new parameters like sentimental resonance that could help predict fake news. This study deals with an extensive data set and uses advanced natural language processing to automate the coding techniques in developing the prediction model.


2021 ◽  
Author(s):  
Nathan Ji ◽  
Yu Sun

The digital age gives us access to a multitude of both information and mediums in which we can interpret information. A majority of the time, many people find interpreting such information difficult as the medium may not be as user friendly as possible. This project has examined the inquiry of how one can identify specific information in a given text based on a question. This inquiry is intended to streamline one's ability to determine the relevance of a given text relative to his objective. The project has an overall 80% success rate given 10 articles with three questions asked per article. This success rate indicates that this project is likely applicable to those who are asking for content level questions within an article.


Sign in / Sign up

Export Citation Format

Share Document