Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.

Download Full-text

A deep database of medical abbreviations and acronyms for natural language processing

Scientific Data ◽

10.1038/s41597-021-00929-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Lisa Grossman Liu ◽

Raymond H. Grossman ◽

Elliot G. Mitchell ◽

Chunhua Weng ◽

Karthik Natarajan ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

American English ◽

Substantial Improvement ◽

Future Application ◽

Multiple Sources ◽

High Coverage ◽

Clinical Text ◽

Automated Quality Control

AbstractThe recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations.

Download Full-text

Quality assurance and enrichment of biological and biomedical ontologies and terminologies

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01342-4 ◽

2020 ◽

Vol 20 (S10) ◽

Author(s):

Ankur Agrawal ◽

Licong Cui

Keyword(s):

Quality Assurance ◽

Cancer Registries ◽

Supplement Issue ◽

Biomedical Ontologies ◽

Snomed Ct ◽

Unified Medical Language System ◽

Domain Specific ◽

Healthcare Settings ◽

Domain Specific Knowledge ◽

Enrichment Techniques

AbstractBiological and biomedical ontologies and terminologies are used to organize and store various domain-specific knowledge to provide standardization of terminology usage and to improve interoperability. The growing number of such ontologies and terminologies and their increasing adoption in clinical, research and healthcare settings call for effective and efficient quality assurance and semantic enrichment techniques of these ontologies and terminologies. In this editorial, we provide an introductory summary of nine articles included in this supplement issue for quality assurance and enrichment of biological and biomedical ontologies and terminologies. The articles cover a range of standards including SNOMED CT, National Cancer Institute Thesaurus, Unified Medical Language System, North American Association of Central Cancer Registries and OBO Foundry Ontologies.

Download Full-text

Early prediction of diagnostic-related groups and estimation of hospital cost by processing clinical notes

npj Digital Medicine ◽

10.1038/s41746-021-00474-9 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Jinghui Liu ◽

Daniel Capurro ◽

Anthony Nguyen ◽

Karin Verspoor

Keyword(s):

Language Processing ◽

Hospital Cost ◽

Diagnosis Related Groups ◽

Characteristic Curve ◽

Healthcare Providers ◽

Post Discharge ◽

Clinical Text ◽

Case Mix Index ◽

Operational Decision Making ◽

Diagnostic Related Groups

AbstractAs healthcare providers receive fixed amounts of reimbursement for given services under DRG (Diagnosis-Related Groups) payment, DRG codes are valuable for cost monitoring and resource allocation. However, coding is typically performed retrospectively post-discharge. We seek to predict DRGs and DRG-based case mix index (CMI) at early inpatient admission using routine clinical text to estimate hospital cost in an acute setting. We examined a deep learning-based natural language processing (NLP) model to automatically predict per-episode DRGs and corresponding cost-reflecting weights on two cohorts (paid under Medicare Severity (MS) DRG or All Patient Refined (APR) DRG), without human coding efforts. It achieved macro-averaged area under the receiver operating characteristic curve (AUC) scores of 0·871 (SD 0·011) on MS-DRG and 0·884 (0·003) on APR-DRG in fivefold cross-validation experiments on the first day of ICU admission. When extended to simulated patient populations to estimate average cost-reflecting weights, the model increased its accuracy over time and obtained absolute CMI error of 2·40 (1·07%) and 12·79% (2·31%), respectively on the first day. As the model could adapt to variations in admission time, cohort size, and requires no extra manual coding efforts, it shows potential to help estimating costs for active patients to support better operational decision-making in hospitals.

Download Full-text

Systematic review of current natural language processing methods and applications in cardiology

Heart ◽

10.1136/heartjnl-2021-319769 ◽

2021 ◽

pp. heartjnl-2021-319769

Author(s):

Meghan Reading Turchioe ◽

Alexander Volodarskiy ◽

Jyotishman Pathak ◽

Drew N Wright ◽

James Enlou Tcheng ◽

...

Keyword(s):

Systematic Review ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Care ◽

Real World Data ◽

Clinical Text ◽

Clinical Notes ◽

Artery Disease ◽

Automated Methods

Natural language processing (NLP) is a set of automated methods to organise and evaluate the information contained in unstructured clinical notes, which are a rich source of real-world data from clinical care that may be used to improve outcomes and understanding of disease in cardiology. The purpose of this systematic review is to provide an understanding of NLP, review how it has been used to date within cardiology and illustrate the opportunities that this approach provides for both research and clinical care. We systematically searched six scholarly databases (ACM Digital Library, Arxiv, Embase, IEEE Explore, PubMed and Scopus) for studies published in 2015–2020 describing the development or application of NLP methods for clinical text focused on cardiac disease. Studies not published in English, lacking a description of NLP methods, non-cardiac focused and duplicates were excluded. Two independent reviewers extracted general study information, clinical details and NLP details and appraised quality using a checklist of quality indicators for NLP studies. We identified 37 studies developing and applying NLP in heart failure, imaging, coronary artery disease, electrophysiology, general cardiology and valvular heart disease. Most studies used NLP to identify patients with a specific diagnosis and extract disease severity using rule-based NLP methods. Some used NLP algorithms to predict clinical outcomes. A major limitation is the inability to aggregate findings across studies due to vastly different NLP methods, evaluation and reporting. This review reveals numerous opportunities for future NLP work in cardiology with more diverse patient samples, cardiac diseases, datasets, methods and applications.

Download Full-text

Similarity of Computations Across Domains Does Not Imply Shared Implementation: The Case of Language Comprehension

Current Directions in Psychological Science ◽

10.1177/09637214211046955 ◽

2021 ◽

Vol 30 (6) ◽

pp. 526-534

Author(s):

Evelina Fedorenko ◽

Cory Shain

Keyword(s):

Language Processing ◽

Language Comprehension ◽

Fluid Intelligence ◽

Linguistic Knowledge ◽

Domain Specific ◽

Cognitive Operations ◽

High Level ◽

General Circuits ◽

The Mind ◽

Language Network

Understanding language requires applying cognitive operations (e.g., memory retrieval, prediction, structure building) that are relevant across many cognitive domains to specialized knowledge structures (e.g., a particular language’s lexicon and syntax). Are these computations carried out by domain-general circuits or by circuits that store domain-specific representations? Recent work has characterized the roles in language comprehension of the language network, which is selective for high-level language processing, and the multiple-demand (MD) network, which has been implicated in executive functions and linked to fluid intelligence and thus is a prime candidate for implementing computations that support information processing across domains. The language network responds robustly to diverse aspects of comprehension, but the MD network shows no sensitivity to linguistic variables. We therefore argue that the MD network does not play a core role in language comprehension and that past findings suggesting the contrary are likely due to methodological artifacts. Although future studies may reveal some aspects of language comprehension that require the MD network, evidence to date suggests that those will not be related to core linguistic processes such as lexical access or composition. The finding that the circuits that store linguistic knowledge carry out computations on those representations aligns with general arguments against the separation of memory and computation in the mind and brain.

Download Full-text

GRAPHICAL INDUCTION OF QUALIFIED MEDICAL KNOWLEDGE

International Journal of Semantic Computing ◽

10.1142/s1793351x13400126 ◽

2013 ◽

Vol 07 (04) ◽

pp. 377-405 ◽

Cited By ~ 1

Author(s):

TRAVIS GOODWIN ◽

SANDA M. HARABAGIU

Keyword(s):

Language Processing ◽

Clinical Data ◽

Medical Knowledge ◽

Knowledge Bases ◽

Clinical Knowledge ◽

Biomedical Knowledge ◽

Clinical Text ◽

Relational Information ◽

Processing Techniques ◽

Medical Concepts

The introduction of electronic medical records (EMRs) enabled the access of unprecedented volumes of clinical data, both in structured and unstructured formats. A significant amount of this clinical data is expressed within the narrative portion of the EMRs, requiring natural language processing techniques to unlock the medical knowledge referred to by physicians. This knowledge, derived from the practice of medical care, complements medical knowledge already encoded in various structured biomedical ontologies. Moreover, the clinical knowledge derived from EMRs also exhibits relational information between medical concepts, derived from the cohesion property of clinical text, which is an attractive attribute that is currently missing from the vast biomedical knowledge bases. In this paper, we describe an automatic method of generating a graph of clinically related medical concepts by considering the belief values associated with those concepts. The belief value is an expression of the clinician's assertion that the concept is qualified as present, absent, suggested, hypothetical, ongoing, etc. Because the method detailed in this paper takes into account the hedging used by physicians when authoring EMRs, the resulting graph encodes qualified medical knowledge wherein each medical concept has an associated assertion (or belief value) and such qualified medical concepts are spanned by relations of different strengths, derived from the clinical contexts in which concepts are used. In this paper, we discuss the construction of a qualified medical knowledge graph (QMKG) and treat it as a BigData problem addressed by using MapReduce for deriving the weighted edges of the graph. To be able to assess the value of the QMKG, we demonstrate its usage for retrieving patient cohorts by enabling query expansion that produces greatly enhanced results against state-of-the-art methods.

Download Full-text

Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data

Bio-Algorithms and Med-Systems ◽

10.1515/bams-2017-0016 ◽

2017 ◽

Vol 13 (4) ◽

Cited By ~ 1

Author(s):

J. Manimaran ◽

T. Velmurugan

Keyword(s):

Language Processing ◽

Patient Discharge ◽

Computational Time ◽

Practical Implementation ◽

Automatic Identification ◽

Kappa Statistics ◽

Research Approach ◽

Clinical Text ◽

Medical Reports ◽

Negation Detection

AbstractBackground:Clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing (NLP) system. In recent development modules of cTAKES, a negation detection (ND) algorithm is used to improve annotation capabilities and simplify automatic identification of negative context in large clinical documents. In this research, the two types of ND algorithms used are lexicon and syntax, which are analyzed using a database made openly available by the National Center for Biomedical Computing. The aim of this analysis is to find the pros and cons of these algorithms.Methods:Patient medical reports were collected from three institutions included the 2010 i2b2/VA Clinical NLP Challenge, which is the input data for this analysis. This database includes patient discharge summaries and progress notes. The patient data is fed into five ND algorithms: NegEx, ConText, pyConTextNLP, DEEPEN and Negation Resolution (NR). NegEx, ConText and pyConTextNLP are lexicon-based, whereas DEEPEN and NR are syntax-based. The results from these five ND algorithms are post-processed and compared with the annotated data. Finally, the performance of these ND algorithms is evaluated by computing standard measures including F-measure, kappa statistics and ROC, among others, as well as the execution time of each algorithm.Results:This research is tested through practical implementation based on the accuracy of each algorithm’s results and computational time to evaluate its performance in order to find a robust and reliable ND algorithm.Conclusions:The performance of the chosen ND algorithms is analyzed based on the results produced by this research approach. The time and accuracy of each algorithm are calculated and compared to suggest the best method.

Download Full-text

Text Simplification

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.52 ◽

2018 ◽

Cited By ~ 1

Author(s):

Horacio Saggion

Keyword(s):

Language Processing ◽

Language Resources ◽

The Past ◽

Text Simplification ◽

Text Readability ◽

Target User ◽

Evaluation Approaches ◽

Linguistic Impairment ◽

Automatic Text ◽

The Web

Over the past decades, information has been made available to a broad audience thanks to the availability of texts on the Web. However, understanding the wealth of information contained in texts can pose difficulties for a number of people including those with poor literacy, cognitive or linguistic impairment, or those with limited knowledge of the language of the text. Text simplification was initially conceived as a technology to simplify sentences so that they would be easier to process by natural-language processing components such as parsers. However, nowadays automatic text simplification is conceived as a technology to transform a text into an equivalent which is easier to read and to understand by a target user. Text simplification concerns both the modification of the vocabulary of the text (lexical simplification) and the modification of the structure of the sentences (syntactic simplification). In this chapter, after briefly introducing the topic of text readability, we give an overview of past and recent methods to address these two problems. We also describe simplification applications and full systems also outline language resources and evaluation approaches.

Download Full-text

Uncovering interpretable potential confounders in electronic medical records

10.1101/2021.02.03.21251034 ◽

2021 ◽

Author(s):

Jiaming Zeng ◽

Michael F. Gensheimer ◽

Daniel L. Rubin ◽

Susan Athey ◽

Ross D. Shachter

Keyword(s):

Electronic Medical Records ◽

Selection Bias ◽

Language Processing ◽

Medical Records ◽

Randomized Clinical Trials ◽

Research Database ◽

High Stake ◽

Medical Decisions ◽

Clinical Text ◽

Using Data

AbstractIn medicine, randomized clinical trials (RCT) are the gold standard for informing treatment decisions. Observational comparative effectiveness research (CER) is often plagued by selection bias, and expert-selected covariates may not be sufficient to adjust for confounding. We explore how the unstructured clinical text in electronic medical records (EMR) can be used to reduce selection bias and improve medical practice. We develop a method based on natural language processing to uncover interpretable potential confounders from the clinical text. We validate our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders against the results from established RCTs. We apply our method to four study cohorts built from localized prostate and lung cancer datasets from the Stanford Cancer Institute Research Database and show that our method adjusts the HR estimate towards the RCT results. We further confirm that the uncovered terms can be interpreted by an oncologist as potential confounders. This research helps enable more credible causal inference using data from EMRs, offers a transparent way to improve the design of observational CER, and could inform high-stake medical decisions. Our method can also be applied to studies within and beyond medicine to extract important information from observational data to support decisions.

Download Full-text

Preparing Clinical Text for Use in Biomedical Research

Medical Informatics ◽

10.4018/978-1-60566-050-9.ch159 ◽

2011 ◽

pp. 2085-2095

Author(s):

John P. Pestian ◽

Lukasz Itert ◽

Charlotte Andersen

Keyword(s):

Language Processing ◽

Medical Center ◽

Clinical Care ◽

Academic Medical Center ◽

Clinical Text ◽

Delivery Of Care ◽

Radiology Reports ◽

Academic Medical ◽

Textual Data ◽

Regulatory Actions

Approximately 57 different types of clinical annotations construct a patient’s medical record. These annotations include radiology reports, discharge summaries, and surgical and nursing notes. Hospitals typically produce millions of text-based medical records over the course of a year. These records are essential for the delivery of care, but many are underutilized or not utilized at all for clinical research. The textual data found in these annotations is a rich source of insights into aspects of clinical care and the clinical delivery system. Recent regulatory actions, however, require that, in many cases, data not obtained through informed consent or data not related to the delivery of care must be made anonymous (as referred to by regulators as harmless), before they can be used. This article describes a practical approach with which Cincinnati Children’s Hospital Medical Center (CCHMC), a large pediatric academic medical center with more than 761,000 annual patient encounters, developed open source software for making pediatric clinical text harmless without losing its rich meaning. Development of the software dealt with many of the issues that often arise in natural language processing, such as data collection, disambiguation, and data scrubbing.

Download Full-text