On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?

Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology

JMIR Medical Informatics ◽

10.2196/20492 ◽

2021 ◽

Vol 9 (7) ◽

pp. e20492

Author(s):

Lea Canales ◽

Sebastian Menke ◽

Stephanie Marchesseau ◽

Ariel D’Agostino ◽

Carlos del Rio-Bermudez ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Performance Metrics ◽

Evaluation Methodology ◽

Free Text ◽

Use Case ◽

Five Phases ◽

Clinical Natural Language Processing

Background Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient’s clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them. Objective Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured. Methods The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called “EHRead Technology” (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard. Results The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs. Conclusions We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.

Get full-text (via PubEx)

LIS4: Lesk Inspired Sense Specific Semantic Similarity using WordNet

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500064 ◽

2021 ◽

pp. 2150006

Author(s):

Saravanakumar Kandasamy ◽

Aswani Kumar Cherukuri

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Gold Standard ◽

Question Answering ◽

Knowledge Based ◽

Benchmark Datasets ◽

Processing Information

Semantic similarity quantification between concepts is one of the inevitable parts in domains like Natural Language Processing, Information Retrieval, Question Answering, etc. to understand the text and their relationships better. Last few decades, many measures have been proposed by incorporating various corpus-based and knowledge-based resources. WordNet and Wikipedia are two of the Knowledge-based resources. The contribution of WordNet in the above said domain is enormous due to its richness in defining a word and all of its relationship with others. In this paper, we proposed an approach to quantify the similarity between concepts that exploits the synsets and the gloss definitions of different concepts using WordNet. Our method considers the gloss definitions, contextual words that are helping in defining a word, synsets of contextual word and the confidence of occurrence of a word in other word’s definition for calculating the similarity. The evaluation based on different gold standard benchmark datasets shows the efficiency of our system in comparison with other existing taxonomical and definitional measures.

Get full-text (via PubEx)

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems

Database ◽

10.1093/database/bay110 ◽

2018 ◽

Vol 2018 ◽

Cited By ~ 8

Author(s):

Wasila Dahdul ◽

Prashanti Manda ◽

Hong Cui ◽

James P Balhoff ◽

T Alexander Dececchi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard

Get full-text (via PubEx)

Textual entailment graphs

Natural Language Engineering ◽

10.1017/s1351324915000108 ◽

2015 ◽

Vol 21 (5) ◽

pp. 699-724 ◽

Cited By ~ 6

Author(s):

LILI KOTLERMAN ◽

IDO DAGAN ◽

BERNARDO MAGNINI ◽

LUISA BENTIVOGLI

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

State Of The Art ◽

Text Analytics ◽

Joint Work ◽

Gold Standard Dataset ◽

Textual Entailment ◽

Interesting Task

AbstractIn this work, we present a novel type of graphs for natural language processing (NLP), namely textual entailment graphs (TEGs). We describe the complete methodology we developed for the construction of such graphs and provide some baselines for this task by evaluating relevant state-of-the-art technology. We situate our research in the context of text exploration, since it was motivated by joint work with industrial partners in the text analytics area. Accordingly, we present our motivating scenario and the first gold-standard dataset of TEGs. However, while our own motivation and the dataset focus on the text exploration setting, we suggest that TEGs can have different usages and suggest that automatic creation of such graphs is an interesting task for the community.

Get full-text (via PubEx)

An Evaluation Methodology for Natural Language Processing Systems

10.21236/ada263301 ◽

1992 ◽

Author(s):

Jeannette G. Neal ◽

Elissa L. Feit ◽

Douglas J. Funke ◽

Christine A. Montgomery

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Evaluation Methodology

Get full-text (via PubEx)

Ambiguous and Incomplete: Natural Language Processing Reveals Problematic Reporting Styles in Thyroid Ultrasound Reports

Methods of Information in Medicine ◽

10.1055/s-0041-1740493 ◽

2022 ◽

Author(s):

Priya H. Dedhia ◽

Kallie Chen ◽

Yiqiang Song ◽

Eric LaRose ◽

Joseph R. Imbus ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Medical Center ◽

Academic Medical Center ◽

Institutional Setting ◽

Thyroid Ultrasound ◽

Test Set ◽

Regional Health Care

Abstract Objective Natural language processing (NLP) systems convert unstructured text into analyzable data. Here, we describe the performance measures of NLP to capture granular details on nodules from thyroid ultrasound (US) reports and reveal critical issues with reporting language. Methods We iteratively developed NLP tools using clinical Text Analysis and Knowledge Extraction System (cTAKES) and thyroid US reports from 2007 to 2013. We incorporated nine nodule features for NLP extraction. Next, we evaluated the precision, recall, and accuracy of our NLP tools using a separate set of US reports from an academic medical center (A) and a regional health care system (B) during the same period. Two physicians manually annotated each test-set report. A third physician then adjudicated discrepancies. The adjudicated “gold standard” was then used to evaluate NLP performance on the test-set. Results A total of 243 thyroid US reports contained 6,405 data elements. Inter-annotator agreement for all elements was 91.3%. Compared with the gold standard, overall recall of the NLP tool was 90%. NLP recall for thyroid lobe or isthmus characteristics was: laterality 96% and size 95%. NLP accuracy for nodule characteristics was: laterality 92%, size 92%, calcifications 76%, vascularity 65%, echogenicity 62%, contents 76%, and borders 40%. NLP recall for presence or absence of lymphadenopathy was 61%. Reporting style accounted for 18% errors. For example, the word “heterogeneous” interchangeably referred to nodule contents or echogenicity. While nodule dimensions and laterality were often described, US reports only described contents, echogenicity, vascularity, calcifications, borders, and lymphadenopathy, 46, 41, 17, 15, 9, and 41% of the time, respectively. Most nodule characteristics were equally likely to be described at hospital A compared with hospital B. Conclusions NLP can automate extraction of critical information from thyroid US reports. However, ambiguous and incomplete reporting language hinders performance of NLP systems regardless of institutional setting. Standardized or synoptic thyroid US reports could improve NLP performance.

Get full-text (via PubEx)

Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

Journal of Medical Internet Research ◽

10.2196/jmir.2426 ◽

2013 ◽

Vol 15 (4) ◽

pp. e73 ◽

Cited By ~ 41

Author(s):

Haijun Zhai ◽

Todd Lingren ◽

Louise Deleger ◽

Qi Li ◽

Megan Kaiser ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Web 2.0 ◽

Language Processing ◽

Gold Standard ◽

High Quality ◽

Clinical Natural Language Processing ◽

Standard Development

Get full-text (via PubEx)

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00147 ◽

2020 ◽

pp. 383-391 ◽

Cited By ~ 1

Author(s):

Yalun Li ◽

Yung-Hung Luo ◽

Jason A. Wampfler ◽

Samuel M. Rubinstein ◽

Firat Tiryaki ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Data Extraction ◽

Complete Response ◽

Time Interval ◽

Data Set ◽

Clinical Notes ◽

Standard Data

PURPOSE Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Get full-text (via PubEx)

Comparing Bayesian Models of Annotation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00040 ◽

2018 ◽

Vol 6 ◽

pp. 571-585 ◽

Cited By ~ 3

Author(s):

Silviu Paun ◽

Bob Carpenter ◽

Jon Chamberlain ◽

Dirk Hovy ◽

Udo Kruschwitz ◽

...

Keyword(s):

Natural Language Processing ◽

Model Selection ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Predictive Accuracy ◽

Item Difficulty ◽

Bayesian Models ◽

Majority Voting ◽

Model Based Analysis

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

Get full-text (via PubEx)

117. Natural Language Processing: An Automated Alternative to Determining Inappropriate Group A Streptococcal Testing

Open Forum Infectious Diseases ◽

10.1093/ofid/ofaa439.162 ◽

2020 ◽

Vol 7 (Supplement_1) ◽

pp. S72-S72

Author(s):

Brian R Lee ◽

Alaina Linafelter ◽

Alaina Burns ◽

Allison Burris ◽

Heather Jones ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sore Throat ◽

Language Processing ◽

Gold Standard ◽

Chart Review ◽

Clinical Symptoms ◽

Group A ◽

Sensitivity Specificity

Abstract Background Acute pharyngitis is one of the most common causes of pediatric health care visits, accounting for approximately 12 million ambulatory care visits each year. Rapid antigen detection tests (RADTs) for Group A Streptococcus (GAS) are one of the most commonly ordered tests in the ambulatory settings. Approximately 40–60% of RADTs are estimated to be inappropriate. Determination of inappropriate RADT frequently requires time-intensive chart reviews. The purpose of this study was to determine if natural language processing (NLP) can provide an accurate and automated alternative for assessing RADT inappropriateness. Methods Patients ≥ 3 years of age who received an RADT while evaluated in our EDs/UCCs between April 2018 and September 2018 were identified. A manual chart review was completed on a 10% random sample to determine the presence of sore throat or viral symptoms (i.e., conjunctivitis, rhinorrhea, cough, diarrhea, hoarse voice, and viral exanthema). Inappropriate RADT was defined as either absence of sore throat or reporting 2 or more viral symptoms. An NLP algorithm was developed independently to assign the presence/absence of symptoms and RADT inappropriateness. The NLP sensitivity/specificity was calculated using the manual chart review sample as the gold standard. Results Manual chart review was completed on 720 patients, of which 320 (44.4%) were considered to have an inappropriate RADT. When compared to the manual review, the NLP approach showed high sensitivity (se) and specificity (sp) when assigning inappropriateness (88.4% and 90.0%, respectively). Optimal sensitivity/specificity was also observed for select symptoms, including sore throat (se: 92.9%, sp: 92.5%), cough (se: 94.5%, sp: 96.5%), and rhinorrhea (se: 86.1%, sp: 95.3%). The prevalence of clinical symptoms was similar when running NLP on subsequent, independent validation sets. After validating the NLP algorithm, a long term monthly trend report was developed. Figure Inappropriate GAS RADTs Determined by NLP, June 2018-May 2020 Conclusion An NLP algorithm can accurately identify inappropriate RADT when compared to a gold standard. Manual chart review requires dozens of hours to complete. In contrast, NLP requires only a couple of minutes and offers the potential to calculate valid metrics that are easily scaled-up to help monitor comprehensive, long-term trends. Disclosures Brian R. Lee, MPH, PhD, Merck (Grant/Research Support)

Get full-text (via PubEx)