Ambiguous and Incomplete: Natural Language Processing Reveals Problematic Reporting Styles in Thyroid Ultrasound Reports

Methods of Information in Medicine ◽

10.1055/s-0041-1740493 ◽

2022 ◽

Author(s):

Priya H. Dedhia ◽

Kallie Chen ◽

Yiqiang Song ◽

Eric LaRose ◽

Joseph R. Imbus ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Medical Center ◽

Academic Medical Center ◽

Institutional Setting ◽

Thyroid Ultrasound ◽

Test Set ◽

Regional Health Care

Abstract Objective Natural language processing (NLP) systems convert unstructured text into analyzable data. Here, we describe the performance measures of NLP to capture granular details on nodules from thyroid ultrasound (US) reports and reveal critical issues with reporting language. Methods We iteratively developed NLP tools using clinical Text Analysis and Knowledge Extraction System (cTAKES) and thyroid US reports from 2007 to 2013. We incorporated nine nodule features for NLP extraction. Next, we evaluated the precision, recall, and accuracy of our NLP tools using a separate set of US reports from an academic medical center (A) and a regional health care system (B) during the same period. Two physicians manually annotated each test-set report. A third physician then adjudicated discrepancies. The adjudicated “gold standard” was then used to evaluate NLP performance on the test-set. Results A total of 243 thyroid US reports contained 6,405 data elements. Inter-annotator agreement for all elements was 91.3%. Compared with the gold standard, overall recall of the NLP tool was 90%. NLP recall for thyroid lobe or isthmus characteristics was: laterality 96% and size 95%. NLP accuracy for nodule characteristics was: laterality 92%, size 92%, calcifications 76%, vascularity 65%, echogenicity 62%, contents 76%, and borders 40%. NLP recall for presence or absence of lymphadenopathy was 61%. Reporting style accounted for 18% errors. For example, the word “heterogeneous” interchangeably referred to nodule contents or echogenicity. While nodule dimensions and laterality were often described, US reports only described contents, echogenicity, vascularity, calcifications, borders, and lymphadenopathy, 46, 41, 17, 15, 9, and 41% of the time, respectively. Most nodule characteristics were equally likely to be described at hospital A compared with hospital B. Conclusions NLP can automate extraction of critical information from thyroid US reports. However, ambiguous and incomplete reporting language hinders performance of NLP systems regardless of institutional setting. Standardized or synoptic thyroid US reports could improve NLP performance.

Get full-text (via PubEx)

Utility of Natural Language Processing for Clinical Quality Measures Reporting

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7605 ◽

2017 ◽

Vol 9 (1) ◽

Author(s):

Dino P. Rumoro ◽

Shital C. Shah ◽

Gillian S. Gibbs ◽

Marilyn M. Hallock ◽

Gordon M. Trenholme ◽

...

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Medical Center ◽

Academic Medical Center ◽

Quality Measures ◽

Surveillance Program ◽

Free Text ◽

Influenza Immunization

ObjectiveTo explain the utility of using an automated syndromic surveillanceprogram with advanced natural language processing (NLP) to improveclinical quality measures reporting for influenza immunization.IntroductionClinical quality measures (CQMs) are tools that help measure andtrack the quality of health care services. Measuring and reportingCQMs helps to ensure that our health care system is deliveringeffective, safe, efficient, patient-centered, equitable, and timely care.The CQM for influenza immunization measures the percentage ofpatients aged 6 months and older seen for a visit between October1 and March 31 who received (or reports previous receipt of) aninfluenza immunization. Centers for Disease Control and Preventionrecommends that everyone 6 months of age and older receive aninfluenza immunization every season, which can reduce influenza-related morbidity and mortality and hospitalizations.MethodsPatients at a large academic medical center who had a visit toan affiliated outpatient clinic during June 1 - 8, 2016 were initiallyidentified using their electronic medical record (EMR). The 2,543patients who were selected did not have documentation of influenzaimmunization in a discrete field of the EMR. All free text notes forthese patients between August 1, 2015 and March 31, 2016 wereretrieved and analyzed using the sophisticated NLP built withinGeographic Utilization of Artificial Intelligence in Real-Timefor Disease Identification and Alert Notification (GUARDIAN)– a syndromic surveillance program – to identify any mention ofinfluenza immunization. The goal was to identify additional cases thatmet the CQM measure for influenza immunization and to distinguishdocumented exceptions. The patients with influenza immunizationmentioned were further categorized by GUARDIAN NLP intoReceived, Recommended, Refused, Allergic, and Unavailable.If more than one category was applicable for a patient, they wereindependently counted in their respective categories. A descriptiveanalysis was conducted, along with manual review of a sample ofcases per each category.ResultsFor the 2,543 patients who did not have influenza immunizationdocumentation in a discrete field of the EMR, a total of 78,642 freetext notes were processed using GUARDIAN. Four hundred fiftythree (17.8%) patients had some mention of influenza immunizationwithin the notes, which could potentially be utilized to meet the CQMinfluenza immunization requirement. Twenty two percent (n=101)of patients mentioned already having received the immunizationwhile 34.7% (n=157) patients refused it during the study time frame.There were 27 patients with the mention of influenza immunization,who could not be differentiated into a specific category. The numberof patients placed into a single category of influenza immunizationwas 351 (77.5%), while 75 (16.6%) were classified into more thanone category. See Table 1.ConclusionsUsing GUARDIAN’s NLP can identify additional patients whomay meet the CQM measure for influenza immunization or whomay be exempt. This tool can be used to improve CQM reportingand improve overall influenza immunization coverage by using it toalert providers. Next steps involve further refinement of influenzaimmunization categories, automating the process of using the NLPto identify and report additional cases, as well as using the NLP forother CQMs.Table 1. Categorization of influenza immunization documentation within freetext notes of 453 patients using NLP

Get full-text (via PubEx)

Natural language processing for automated annotation of medication mentions in primary care visit conversations

JAMIA Open ◽

10.1093/jamiaopen/ooab071 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Craig H Ganoe ◽

Weiyi Wu ◽

Paul J Barr ◽

William Haslett ◽

Michelle D Dannenberg ◽

...

Keyword(s):

Primary Care ◽

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Primary Care Visit ◽

Data Set ◽

Test Set ◽

Medication Information ◽

Care Visit

Abstract Objectives The objective of this study is to build and evaluate a natural language processing approach to identify medication mentions in primary care visit conversations between patients and physicians. Materials and Methods Eight clinicians contributed to a data set of 85 clinic visit transcripts, and 10 transcripts were randomly selected from this data set as a development set. Our approach utilizes Apache cTAKES and Unified Medical Language System controlled vocabulary to generate a list of medication candidates in the transcribed text and then performs multiple customized filters to exclude common false positives from this list while including some additional common mentions of the supplements and immunizations. Results Sixty-five transcripts with 1121 medication mentions were randomly selected as an evaluation set. Our proposed method achieved an F-score of 85.0% for identifying the medication mentions in the test set, significantly outperforming existing medication information extraction systems for medical records with F-scores ranging from 42.9% to 68.9% on the same test set. Discussion Our medication information extraction approach for primary care visit conversations showed promising results, extracting about 27% more medication mentions from our evaluation set while eliminating many false positives in comparison to existing baseline systems. We made our approach publicly available on the web as an open-source software. Conclusion Integration of our annotation system with clinical recording applications has the potential to improve patients’ understanding and recall of key information from their clinic visits, and, in turn, to positively impact health outcomes.

Get full-text (via PubEx)

LIS4: Lesk Inspired Sense Specific Semantic Similarity using WordNet

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500064 ◽

2021 ◽

pp. 2150006

Author(s):

Saravanakumar Kandasamy ◽

Aswani Kumar Cherukuri

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Gold Standard ◽

Question Answering ◽

Knowledge Based ◽

Benchmark Datasets ◽

Processing Information

Semantic similarity quantification between concepts is one of the inevitable parts in domains like Natural Language Processing, Information Retrieval, Question Answering, etc. to understand the text and their relationships better. Last few decades, many measures have been proposed by incorporating various corpus-based and knowledge-based resources. WordNet and Wikipedia are two of the Knowledge-based resources. The contribution of WordNet in the above said domain is enormous due to its richness in defining a word and all of its relationship with others. In this paper, we proposed an approach to quantify the similarity between concepts that exploits the synsets and the gloss definitions of different concepts using WordNet. Our method considers the gloss definitions, contextual words that are helping in defining a word, synsets of contextual word and the confidence of occurrence of a word in other word’s definition for calculating the similarity. The evaluation based on different gold standard benchmark datasets shows the efficiency of our system in comparison with other existing taxonomical and definitional measures.

Get full-text (via PubEx)

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems

Database ◽

10.1093/database/bay110 ◽

2018 ◽

Vol 2018 ◽

Cited By ~ 8

Author(s):

Wasila Dahdul ◽

Prashanti Manda ◽

Hong Cui ◽

James P Balhoff ◽

T Alexander Dececchi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard

Get full-text (via PubEx)

Textual entailment graphs

Natural Language Engineering ◽

10.1017/s1351324915000108 ◽

2015 ◽

Vol 21 (5) ◽

pp. 699-724 ◽

Cited By ~ 6

Author(s):

LILI KOTLERMAN ◽

IDO DAGAN ◽

BERNARDO MAGNINI ◽

LUISA BENTIVOGLI

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

State Of The Art ◽

Text Analytics ◽

Joint Work ◽

Gold Standard Dataset ◽

Textual Entailment ◽

Interesting Task

AbstractIn this work, we present a novel type of graphs for natural language processing (NLP), namely textual entailment graphs (TEGs). We describe the complete methodology we developed for the construction of such graphs and provide some baselines for this task by evaluating relevant state-of-the-art technology. We situate our research in the context of text exploration, since it was motivated by joint work with industrial partners in the text analytics area. Accordingly, we present our motivating scenario and the first gold-standard dataset of TEGs. However, while our own motivation and the dataset focus on the text exploration setting, we suggest that TEGs can have different usages and suggest that automatic creation of such graphs is an interesting task for the community.

Get full-text (via PubEx)

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

JMIR Medical Informatics ◽

10.2196/20492 ◽

2021 ◽

Vol 9 (7) ◽

pp. e20492

Author(s):

Lea Canales ◽

Sebastian Menke ◽

Stephanie Marchesseau ◽

Ariel D’Agostino ◽

Carlos del Rio-Bermudez ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Performance Metrics ◽

Evaluation Methodology ◽

Free Text ◽

Use Case ◽

Five Phases ◽

Clinical Natural Language Processing

Background Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient’s clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them. Objective Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured. Methods The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called “EHRead Technology” (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard. Results The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs. Conclusions We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.

Get full-text (via PubEx)

Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

Journal of Medical Internet Research ◽

10.2196/jmir.2426 ◽

2013 ◽

Vol 15 (4) ◽

pp. e73 ◽

Cited By ~ 41

Author(s):

Haijun Zhai ◽

Todd Lingren ◽

Louise Deleger ◽

Qi Li ◽

Megan Kaiser ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Web 2.0 ◽

Language Processing ◽

Gold Standard ◽

High Quality ◽

Clinical Natural Language Processing ◽

Standard Development

Get full-text (via PubEx)

A Natural Language Processing Tool to Extract Quantitative Smoking Status from Clinical Narratives

10.1101/2020.10.30.20223511 ◽

2020 ◽

Author(s):

Xi Yang ◽

Hanyuan Yang ◽

Tianchen Lyu ◽

Shuang Yang ◽

Yi Guo ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Smoking Status ◽

Lung Cancer Screening ◽

Rule Engine ◽

Training Set ◽

Test Set ◽

Clinical Notes ◽

Natural Language Processing Tool

AbstractThis study presents a natural language processing (NLP) tool to extract quantitative smoking information (e.g., Pack-Year, Quit Year, Smoking Year, and Pack per Day) from clinical notes and standardized them into Pack-Year unit. We annotated a corpus of 200 clinical notes from patients who had low-dose CT imaging procedures for lung cancer screening and developed an NLP system using a two-layer rule-engine structure. We divided the 200 notes into a training set and a test set and developed the NLP system only using the training set. The experimental results on the test set showed that our NLP system achieved the best F1 scores of 0.963 and 0.946 for lenient and strict evaluation, respectively.NoteAccepted as a presentation at the 2020 IEEE International Conference on Healthcare Informatics (ICHI) Workshop on Health Natural Language Processing (HealthNLP 2020).https://ohnlp.github.io/HealthNLP2020/healthnlp2020#.

Get full-text (via PubEx)

On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?

Proceedings of the 8th International Conference on Agents and Artificial Intelligence ◽

10.5220/0005824805400545 ◽

2016 ◽

Cited By ~ 1

Author(s):

Vojtěch Kovář ◽

Miloš Jakubíček ◽

Aleš Horák

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Evaluation Methodology

Get full-text (via PubEx)

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00147 ◽

2020 ◽

pp. 383-391 ◽

Cited By ~ 1

Author(s):

Yalun Li ◽

Yung-Hung Luo ◽

Jason A. Wampfler ◽

Samuel M. Rubinstein ◽

Firat Tiryaki ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Data Extraction ◽

Complete Response ◽

Time Interval ◽

Data Set ◽

Clinical Notes ◽

Standard Data

PURPOSE Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Get full-text (via PubEx)