Data Science and Natural Language Processing to Extract Information in Clinical Domain

AbstractEvaluating natural language processing (NLP) systems in the clinical domain is a difficult task which is important for advancement of the field. A number of NLP systems have been reported that extract information from free-text clinical reports, but not many of the systems have been evaluated. Those that were evaluated noted good performance measures but the results were often weakened by ineffective evaluation methods. In this paper we describe a set of criteria aimed at improving the quality of NLP evaluation studies. We present an overview of NLP evaluations in the clinical domain and also discuss the Message Understanding Conferences (MUC) [1-41. Although these conferences constitute a series of NLP evaluation studies performed outside of the clinical domain, some of the results are relevant within medicine. In addition, we discuss a number of factors which contribute to the complexity that is inherent in the task of evaluating natural language systems.

Download Full-text

Advanced Well Planning Using Natural Language Processing NLP and Data Science Models: Maximizing the Value of Data to Mitigate Costs and Risks in New Wells

10.2118/203280-ms ◽

2020 ◽

Author(s):

John Cumming ◽

Valentina Riggins ◽

Paul Hodson ◽

Barry Walker

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Science ◽

Science Models

Download Full-text

The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in the Field of Natural Language Processing and Network Theory

Big Data and Cognitive Computing ◽

10.3390/bdcc2030022 ◽

2018 ◽

Vol 2 (3) ◽

pp. 22 ◽

Cited By ~ 3

Author(s):

Jeffrey Ray ◽

Olayinka Johnny ◽

Marcello Trovati ◽

Stelios Sotiriadis ◽

Nik Bessis

Keyword(s):

Big Data ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Network Theory ◽

Data Science ◽

Theoretical Foundation ◽

Scientific Field ◽

Research Challenges ◽

New Research

The continuous creation of data has posed new research challenges due to its complexity, diversity and volume. Consequently, Big Data has increasingly become a fully recognised scientific field. This article provides an overview of the current research efforts in Big Data science, with particular emphasis on its applications, as well as theoretical foundation.

Download Full-text

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview (Preprint)

10.2196/preprints.23375 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanshan Wang ◽

Sunyang Fu ◽

Feichen Shen ◽

Sam Henry ◽

Ozlem Uzuner ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Shared Task ◽

Data Set ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Domain ◽

Semantic Textual Similarity

BACKGROUND Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semantic equivalence between 2 snippets of clinical text. Due to the frequent use of templates in the Electronic Health Record system, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in downstream clinical natural language processing applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval. OBJECTIVE Our objective was to release ClinicalSTS data sets and to motivate natural language processing and biomedical informatics communities to tackle semantic text similarity tasks in the clinical domain. METHODS We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world ClinicalSTS data set. We continued the shared task in 2019 in collaboration with National NLP Clinical Challenges (n2c2) and the Open Health Natural Language Processing (OHNLP) consortium and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS data set comprising 1642 clinical sentence pairs, including 1068 pairs from the 2018 shared task and 1006 new pairs from 2 electronic health record systems, GE and Epic. We released 80% (1642/2054) of the data to participating teams to develop and fine-tune the semantic textual similarity systems and used the remaining 20% (412/2054) as blind testing to evaluate their systems. The workshop was held in conjunction with the American Medical Informatics Association 2019 Annual Symposium. RESULTS Of the 78 international teams that signed on to the n2c2/OHNLP ClinicalSTS shared task, 33 produced a total of 87 valid system submissions. The top 3 systems were generated by IBM Research, the National Center for Biotechnology Information, and the University of Florida, with Pearson correlations of r=.9010, r=.8967, and r=.8864, respectively. Most top-performing systems used state-of-the-art neural language models, such as BERT and XLNet, and state-of-the-art training schemas in deep learning, such as pretraining and fine-tuning schema, and multitask learning. Overall, the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data being GE sentence pairs. CONCLUSIONS The 2019 n2c2/OHNLP ClinicalSTS shared task focused on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. The ClinicalSTS shared task could continue to serve as a venue for researchers in natural language processing and medical informatics communities to develop and improve semantic textual similarity techniques for clinical text.

Download Full-text

Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text

Yearbook of Medical Informatics ◽

10.15265/iy-2017-029 ◽

2017 ◽

Vol 26 (01) ◽

pp. 214-227 ◽

Cited By ~ 29

Author(s):

G. Gonzalez-Hernandez ◽

A. Sarker ◽

K. O’Connor ◽

G. Savova

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Science ◽

Open Systems ◽

Text Processing ◽

Research Progress ◽

Research Problems ◽

Health Related

Summary Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.

Download Full-text

Extracting Information from Archaeological Texts

Open Archaeology ◽

10.1515/opar-2015-0004 ◽

2015 ◽

Vol 1 (1) ◽

Cited By ~ 6

Author(s):

Keith W. Kintigh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Automated Classification ◽

Automated Extraction ◽

Processing Technologies ◽

Automated Translation ◽

Extract Information

AbstractTo address archaeology’s most pressing substantive challenges, researchers must discover, access, and extract information contained in the reports and articles that codify so much of archaeology’s knowledge. These efforts will require application of existing and emerging natural language processing technologies to extensive digital corpora. Automated classification can enable development of metadata needed for the discovery of relevant documents. Although it is even more technically challenging, automated extraction of and reasoning with information from texts can provide urgently needed access to contextualized information within documents. Effective automated translation is needed for scholars to benefit from research published in other languages.

Download Full-text

Tracking mutational semantics of SARS-CoV-2 genomes

10.1101/2021.12.21.21268187 ◽

2021 ◽

Author(s):

Rohan Singh ◽

Sunil Nagpal ◽

Nishal Kumar Pinna ◽

Sharmila S Mande

Keyword(s):

Biological Sciences ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Data Science ◽

Higher Order ◽

Dynamic Phenomenon ◽

Alternative Strategies ◽

Daunting Task

Genomes have an inherent context dictated by the order in which the nucleotides and higher order genomic elements are arranged in the DNA/RNA. Learning this context is a daunting task, governed by the combinatorial complexity of interactions possible between ordered elements of genomes. Can natural language processing be employed on these orderly, complex and also evolving datatypes (genomic sequences) to reveal the latent patterns or context of genomic elements (e.g Mutations)? Here we present an approach to understand the mutational landscape of Covid-19 by treating the temporally changing (continuously mutating) SARS-CoV-2 genomes as documents. We demonstrate how the analogous interpretation of evolving genomes to temporal literature corpora provides an opportunity to use dynamic topic modeling (DTM) and temporal Word2Vec models to delineate mutation signatures corresponding to different Variants-of-Concerns and tracking the semantic drift of Mutations-of-Concern (MoC). We identified and studied characteristic mutations affiliated to Covid-infection severity and tracked their relationship with MoCs. Our ground work on utility of such temporal NLP models in genomics could supplement ongoing efforts in not only understanding the Covid pandemic but also provide alternative strategies in studying dynamic phenomenon in biological sciences through data science (especially NLP, AI/ML).

Download Full-text

Customers' experience of purchasing event tickets: mining online reviews based on topic modeling and sentiment analysis

International Journal of Event and Festival Management ◽

10.1108/ijefm-06-2020-0034 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Krzysztof Celuch

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Topic Modeling ◽

Data Science ◽

Latent Dirichlet Allocation ◽

Online Reviews ◽

Third Party ◽

Content Type

PurposeIn search of creating an extraordinary experience for customers, services have gone beyond the means of a transaction between buyers and sellers. In the event industry, where purchasing tickets online is a common procedure, it remains unclear as to how to enhance the multifaceted experience. This study aims at offering a snapshot into the most valued aspects for consumers and to uncover consumers' feelings toward their experience of purchasing event tickets on third-party ticketing platforms.Design/methodology/approachThis is a cross-disciplinary study that applies knowledge from both data science and services marketing. Under the guise of natural language processing, latent Dirichlet allocation topic modeling and sentiment analysis were used to interpret the embedded meanings based on online reviews.FindingsThe findings conceptualized ten dimensions valued by eventgoers, including technical issues, value of core product and service, word-of-mouth, trustworthiness, professionalism and knowledgeability, customer support, information transparency, additional fee, prior experience and after-sales service. Among these aspects, consumers rated the value of the core product and service to be the most positive experience, whereas the additional fee was considered the least positive one.Originality/valueDrawing from the intersection of natural language processing and the status quo of the event industry, this study offers a better understanding of eventgoers' experiences in the case of purchasing online event tickets. It also provides a hands-on guide for marketers to stage memorable experiences in the era of digitalization.

Download Full-text

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks

AERA Open ◽

10.1177/2332858420940312 ◽

2020 ◽

Vol 6 (3) ◽

pp. 233285842094031

Author(s):

Li Lucy ◽

Dorottya Demszky ◽

Patricia Bromley ◽

Dan Jurafsky

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Race And Ethnicity ◽

Data Science ◽

Black People ◽

History Textbooks ◽

Word Embeddings ◽

Representation Of Women ◽

New Research

Cutting-edge data science techniques can shed new light on fundamental questions in educational research. We apply techniques from natural language processing (lexicons, word embeddings, topic models) to 15 U.S. history textbooks widely used in Texas between 2015 and 2017, studying their depiction of historically marginalized groups. We find that Latinx people are rarely discussed, and the most common famous figures are nearly all White men. Lexicon-based approaches show that Black people are described as performing actions associated with low agency and power. Word embeddings reveal that women tend to be discussed in the contexts of work and the home. Topic modeling highlights the higher prominence of political topics compared with social ones. We also find that more conservative counties tend to purchase textbooks with less representation of women and Black people. Building on a rich tradition of textbook analysis, we release our computational toolkit to support new research directions.

Download Full-text

Data Science and Natural Language Processing to Extract Information in Clinical Domain

Data Science and Natural Language Processing to Extract Information from Clinical Narratives

Evaluating Natural Language Processors in the Clinical Domain

Advanced Well Planning Using Natural Language Processing NLP and Data Science Models: Maximizing the Value of Data to Mitigate Costs and Risks in New Wells

The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in the Field of Natural Language Processing and Network Theory

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview (Preprint)

Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text

Extracting Information from Archaeological Texts

Tracking mutational semantics of SARS-CoV-2 genomes

Customers' experience of purchasing event tickets: mining online reviews based on topic modeling and sentiment analysis

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks

Export Citation Format