Linguistic typology in natural language processing

2016 ◽  
Vol 20 (3) ◽  
Author(s):  
Emily M. Bender

AbstractThis paper explores the ways in which the field of natural language processing (NLP) can and does benefit from work in linguistic typology. I describe the recent increase in interest in multilingual natural language processing and give a high-level overview of the field. I then turn to a discussion of how linguistic knowledge in general is incorporated in NLP technology before describing how typological results in particular are used. I consider both rule-based and machine learning approaches to NLP and review literature on predicting typological features as well as that which leverages such features.

2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


2020 ◽  
pp. 1-10
Author(s):  
Roser Morante ◽  
Eduardo Blanco

Abstract Negation is a complex linguistic phenomenon present in all human languages. It can be seen as an operator that transforms an expression into another expression whose meaning is in some way opposed to the original expression. In this article, we survey previous work on negation with an emphasis on computational approaches. We start defining negation and two important concepts: scope and focus of negation. Then, we survey work in natural language processing that considers negation primarily as a means to improve the results in some task. We also provide information about corpora containing negation annotations in English and other languages, which usually include a combination of annotations of negation cues, scopes, foci, and negated events. We continue the survey with a description of automated approaches to process negation, ranging from early rule-based systems to systems built with traditional machine learning and neural networks. Finally, we conclude with some reflections on current progress and future directions.


Author(s):  
Ayushi Mitra

Sentiment analysis or Opinion Mining or Emotion Artificial Intelligence is an on-going field which refers to the use of Natural Language Processing, analysis of text and is utilized to extract quantify and is used to study the emotional states from a given piece of information or text data set. It is an area that continues to be currently in progress in field of text mining. Sentiment analysis is utilized in many corporations for review of products, comments from social media and from a small amount of it is utilized to check whether or not the text is positive, negative or neutral. Throughout this research work we wish to adopt rule- based approaches which defines a set of rules and inputs like Classic Natural Language Processing techniques, stemming, tokenization, a region of speech tagging and parsing of machine learning for sentiment analysis which is going to be implemented by most advanced python language.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
P Brekke ◽  
I Pilan ◽  
H Husby ◽  
T Gundersen ◽  
F.A Dahl ◽  
...  

Abstract Background Syncope is a commonly occurring presenting symptom in emergency departments. While the majority of episodes are benign, syncope is associated with worse prognosis in hypertrophic cardiomyopathy, arrhythmia syndromes, heart failure, aortic stenosis and coronary heart disease. Flagging documented syncope in these patients may be crucial to management decisions. Previous studies show that the International Classification of Diseases (ICD) codes for syncope have a sensitivity of around 0.63, leading to a large number of false negatives if patient identification is based on administrative codes. Thus, in order to provide data-driven, clinical decision support, and to improve identification of patient cohorts for research, better tools are needed. A recent study manually annotated more than 30.000 patient records in order to develop a natural language processing (NLP) tool, which achieved a sensitivity of 92.2%. Since access to medical records and annotation resources is limited, we aimed to investigate whether an unsupervised machine learning and NLP approach with no manual input could achieve similar performance. Methods Our data was admission notes for adult patients admitted between 2005 and 2016 at a large university hospital in Norway. 500 records from patients with, and 500 without a “R55 Syncope” ICD code at discharge were drawn at random. R55 code was considered “ground truth”. Headers containing information about tentative diagnoses were removed from the notes, when present, using regular expressions. The dataset was divided into 70%/15%/15% subsets for training, validation and testing. Baseline identification was calculated by a simple lexical matching using the term “synkope”. We evaluated two linear classifiers, a Support Vector Machine (SVM) and a Linear Regression (LR) model, with a term frequency–inverse document frequency vectorizer, using a bag-of-words approach. In addition, we evaluated a simple convolutional neural network (CNN) consisting of a convolutional layer concatenating filter sizes of 3–5, max pooling and a dropout of 0.5 with randomly initialised word embeddings of 300 dimensions. Results Even a baseline regular expression model achieved a sensitivity of 78% and a specificity of 91% when classifying admission notes as belonging to the syncope class or not. The SVM model and the LR model achieved a sensitivity of 91% and 89%, respectively, and a specificity of 89% and 91%. The CNN model had a sensitivity of 95% and a specificity of 84%. Conclusion With a limited non-English dataset, common NLP and machine learning approaches were able to achieve approximately 90–95% sensitivity for the identification of admission notes related to syncope. Linear classifiers outperformed a CNN model in terms of specificity, as expected in this small dataset. The study demonstrates the feasibility of training document classifiers based on diagnostic codes in order to detect important clinical events. ROC curves for SVM and LR models Funding Acknowledgement Type of funding source: Public grant(s) – National budget only. Main funding source(s): The Research Council of Norway


Author(s):  
Filipe R Lucini ◽  
Karla D Krewulak ◽  
Kirsten M Fiest ◽  
Sean M Bagshaw ◽  
Danny J Zuege ◽  
...  

Abstract Objective To apply natural language processing (NLP) techniques to identify individual events and modes of communication between healthcare professionals and families of critically ill patients from electronic medical records (EMR). Materials and Methods Retrospective cohort study of 280 randomly selected adult patients admitted to 1 of 15 intensive care units (ICU) in Alberta, Canada from June 19, 2012 to June 11, 2018. Individual events and modes of communication were independently abstracted using NLP and manual chart review (reference standard). Preprocessing techniques and 2 NLP approaches (rule-based and machine learning) were evaluated using sensitivity, specificity, and area under the receiver operating characteristic curves (AUROC). Results Over 2700 combinations of NLP methods and hyperparameters were evaluated for each mode of communication using a holdout subset. The rule-based approach had the highest AUROC in 65 datasets compared to the machine learning approach in 21 datasets. Both approaches had similar performance in 17 datasets. The rule-based AUROC for the grouped categories of patient documented to have family or friends (0.972, 95% CI 0.934–1.000), visit by family/friend (0.882 95% CI 0.820–0.943) and phone call with family/friend (0.975, 95% CI: 0.952–0.998) were high. Discussion We report an automated method to quantify communication between healthcare professionals and family members of adult patients from free-text EMRs. A rule-based NLP approach had better overall operating characteristics than a machine learning approach. Conclusion NLP can automatically and accurately measure frequency and mode of documented family visitation and communication from unstructured free-text EMRs, to support patient- and family-centered care initiatives.


2018 ◽  
Author(s):  
Sunyang Fu ◽  
Lester Y Leung ◽  
Yanshan Wang ◽  
Anne-Olivia Raulli ◽  
David F Kallmes ◽  
...  

BACKGROUND Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. OBJECTIVE This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. METHODS Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. RESULTS A total of 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. CONCLUSIONS We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.


Sign in / Sign up

Export Citation Format

Share Document