Methods of Information in Medicine

Ambiguous and Incomplete: Natural Language Processing Reveals Problematic Reporting Styles in Thyroid Ultrasound Reports

Methods of Information in Medicine ◽

10.1055/s-0041-1740493 ◽

2022 ◽

Author(s):

Priya H. Dedhia ◽

Kallie Chen ◽

Yiqiang Song ◽

Eric LaRose ◽

Joseph R. Imbus ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Medical Center ◽

Academic Medical Center ◽

Institutional Setting ◽

Thyroid Ultrasound ◽

Test Set ◽

Regional Health Care

Abstract Objective Natural language processing (NLP) systems convert unstructured text into analyzable data. Here, we describe the performance measures of NLP to capture granular details on nodules from thyroid ultrasound (US) reports and reveal critical issues with reporting language. Methods We iteratively developed NLP tools using clinical Text Analysis and Knowledge Extraction System (cTAKES) and thyroid US reports from 2007 to 2013. We incorporated nine nodule features for NLP extraction. Next, we evaluated the precision, recall, and accuracy of our NLP tools using a separate set of US reports from an academic medical center (A) and a regional health care system (B) during the same period. Two physicians manually annotated each test-set report. A third physician then adjudicated discrepancies. The adjudicated “gold standard” was then used to evaluate NLP performance on the test-set. Results A total of 243 thyroid US reports contained 6,405 data elements. Inter-annotator agreement for all elements was 91.3%. Compared with the gold standard, overall recall of the NLP tool was 90%. NLP recall for thyroid lobe or isthmus characteristics was: laterality 96% and size 95%. NLP accuracy for nodule characteristics was: laterality 92%, size 92%, calcifications 76%, vascularity 65%, echogenicity 62%, contents 76%, and borders 40%. NLP recall for presence or absence of lymphadenopathy was 61%. Reporting style accounted for 18% errors. For example, the word “heterogeneous” interchangeably referred to nodule contents or echogenicity. While nodule dimensions and laterality were often described, US reports only described contents, echogenicity, vascularity, calcifications, borders, and lymphadenopathy, 46, 41, 17, 15, 9, and 41% of the time, respectively. Most nodule characteristics were equally likely to be described at hospital A compared with hospital B. Conclusions NLP can automate extraction of critical information from thyroid US reports. However, ambiguous and incomplete reporting language hinders performance of NLP systems regardless of institutional setting. Standardized or synoptic thyroid US reports could improve NLP performance.

Human Versus Machine: How Do We Know Who Is Winning? ROC Analysis for Comparing Human and Machine Performance under Varying Cost-Prevalence Assumptions

Methods of Information in Medicine ◽

10.1055/s-0041-1740565 ◽

2021 ◽

Author(s):

Michael Merry ◽

Patricia Jean Riddle ◽

Jim Warren

Keyword(s):

Receiver Operating Characteristic ◽

Sensitivity And Specificity ◽

Standard Method ◽

Roc Analysis ◽

Operating Characteristic ◽

Model Performance ◽

Analytical Techniques ◽

Machine Performance ◽

Versus Model ◽

Review Current

Abstract Background Receiver operating characteristic (ROC) analysis is commonly used for comparing models and humans; however, the exact analytical techniques vary and some are flawed. Objectives The aim of the study is to identify common flaws in ROC analysis for human versus model performance, and address them. Methods We review current use and identify common errors. We also review the ROC analysis literature for more appropriate techniques. Results We identify concerns in three techniques: (1) using mean human sensitivity and specificity; (2) assuming humans can be approximated by ROCs; and (3) matching sensitivity and specificity. We identify a technique from Provost et al using dominance tables and cost-prevalence gradients that can be adapted to address these concerns. Conclusion Dominance tables and cost-prevalence gradients provide far greater detail when comparing performances of models and humans, and address common failings in other approaches. This should be the standard method for such analyses moving forward.

Measurement Performance of Activity Measurements with Newer Generation of Apple Watch in Wheelchair Users with Spinal Cord Injury

Methods of Information in Medicine ◽

10.1055/s-0041-1740236 ◽

2021 ◽

Author(s):

Nils-Hendrik Benning ◽

Petra Knaup ◽

Rüdiger Rupp

Keyword(s):

Spinal Cord ◽

Spinal Cord Injury ◽

T Test ◽

Future Research ◽

Altman Plot ◽

Wheelchair Users ◽

Bland Altman Plot ◽

Activity Measurements ◽

Cord Injury

Abstract Background The level of physical activity (PA) of people with spinal cord injury (SCI) has an impact on long-term complications. Currently, PA is mostly assessed by interviews. Wearable activity trackers are promising tools to objectively measure PA under everyday conditions. The only off-the-shelf, wearable activity tracker with specific measures for wheelchair users is the Apple Watch. Objectives This study analyzes the measurement performance of Apple Watch Series 4 for wheelchair users and compares it with an earlier generation of the device. Methods Fifteen participants with subacute SCI during their first in-patient phase followed a test course using their wheelchair. The number of wheelchair pushes was counted manually by visual inspection and with the Apple Watch. Difference between the Apple Watch and the rater was analyzed with mean absolute percent error (MAPE) and a Bland–Altman plot. To compare the measurement error of Series 4 and an older generation of the device a t-test was calculated using data for Series 1 from a former study. Results The average of differences was 12.33 pushes (n = 15), whereas participants pushed the wheelchair 138.4 times on average (range 86–271 pushes). The range of difference and the Bland–Altman plot indicate an overestimation by Apple Watch. MAPE is 9.20% and the t-test, testing for an effect of Series 4 on the percentage of error compared with Series 1, was significant with p < 0.05. Conclusion Series 4 shows a significant improvement in measurement performance compared with Series 1. Series 4 can be considered as a promising data source to capture the number of wheelchair pushes on even grounds. Future research should analyze the long-term measurement performance during everyday conditions of Series 4.

A Data-Driven Iterative Approach for Semi-automatically Assessing the Correctness of Medication Value Sets: A Proof of Concept Based on Opioids

Methods of Information in Medicine ◽

10.1055/s-0041-1740358 ◽

2021 ◽

Vol 60 (S 02) ◽

pp. e111-e119

Author(s):

Linyi Li ◽

Adela Grando ◽

Abeed Sarker

Keyword(s):

Language Processing ◽

Information Exchange ◽

Health Information Exchange ◽

Automatic Data ◽

Opioid Medication ◽

Value Set ◽

Domain Experts ◽

Potential System ◽

Value Sets ◽

Trade Names

Abstract Background Value sets are lists of terms (e.g., opioid medication names) and their corresponding codes from standard clinical vocabularies (e.g., RxNorm) created with the intent of supporting health information exchange and research. Value sets are manually-created and often exhibit errors. Objectives The aim of the study is to develop a semi-automatic, data-centric natural language processing (NLP) method to assess medication-related value set correctness and evaluate it on a set of opioid medication value sets. Methods We developed an NLP algorithm that utilizes value sets containing mostly true positives and true negatives to learn lexical patterns associated with the true positives, and then employs these patterns to identify potential errors in unseen value sets. We evaluated the algorithm on a set of opioid medication value sets, using the recall, precision and F1-score metrics. We applied the trained model to assess the correctness of unseen opioid value sets based on recall. To replicate the application of the algorithm in real-world settings, a domain expert manually conducted error analysis to identify potential system and value set errors. Results Thirty-eight value sets were retrieved from the Value Set Authority Center, and six (two opioid, four non-opioid) were used to develop and evaluate the system. Average precision, recall, and F1-score were 0.932, 0.904, and 0.909, respectively on uncorrected value sets; and 0.958, 0.953, and 0.953, respectively after manual correction of the same value sets. On 20 unseen opioid value sets, the algorithm obtained average recall of 0.89. Error analyses revealed that the main sources of system misclassifications were differences in how opioids were coded in the value sets—while the training value sets had generic names mostly, some of the unseen value sets had new trade names and ingredients. Conclusion The proposed approach is data-centric, reusable, customizable, and not resource intensive. It may help domain experts to easily validate value sets.

A Semi-Automated Term Harmonization Pipeline Applied to Pulmonary Arterial Hypertension Clinical Trials

Methods of Information in Medicine ◽

10.1055/s-0041-1739361 ◽

2021 ◽

Author(s):

Ryan J. Urbanowicz ◽

John H. Holmes ◽

Dina Appleby ◽

Vanamala Narasimhan ◽

Stephen Durborow ◽

...

Keyword(s):

Clinical Trials ◽

Pulmonary Arterial Hypertension ◽

Arterial Hypertension ◽

System Organ Class ◽

High Level Group ◽

Fuzzy Matching ◽

Pulmonary Arterial ◽

High Level ◽

Level Group ◽

System Organ

Abstract Objective Data harmonization is essential to integrate individual participant data from multiple sites, time periods, and trials for meta-analysis. The process of mapping terms and phrases to an ontology is complicated by typographic errors, abbreviations, truncation, and plurality. We sought to harmonize medical history (MH) and adverse events (AE) term records across 21 randomized clinical trials in pulmonary arterial hypertension and chronic thromboembolic pulmonary hypertension. Methods We developed and applied a semi-automated harmonization pipeline for use with domain-expert annotators to resolve ambiguous term mappings using exact and fuzzy matching. We summarized MH and AE term mapping success, including map quality measures, and imputation of a generalizing term hierarchy as defined by the applied Medical Dictionary for Regulatory Activities (MedDRA) ontology standard. Results Over 99.6% of both MH (N = 37,105) and AE (N = 58,170) records were successfully mapped to MedDRA low-level terms. Automated exact matching accounted for 74.9% of MH and 85.5% of AE mappings. Term recommendations from fuzzy matching in the pipeline facilitated annotator mapping of the remaining 24.9% of MH and 13.8% of AE records. Imputation of the generalized MedDRA term hierarchy was unambiguous in 85.2% of high-level terms, 99.4% of high-level group terms, and 99.5% of system organ class in MH, and 75% of high-level terms, 98.3% of high-level group terms, and 98.4% of system organ class in AE. Conclusion This pipeline dramatically reduced the burden of manual annotation for MH and AE term harmonization and could be adapted to other data integration efforts.

Developing an Analytical Pipeline to Classify Patient Safety Event Reports Using Optimized Predictive Algorithms

Methods of Information in Medicine ◽

10.1055/s-0041-1735620 ◽

2021 ◽

Author(s):

Asa Adadey ◽

Robert Giannini ◽

Lorraine B. Possanza

Keyword(s):

Patient Safety ◽

Language Processing ◽

Safety Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Valuable Insight ◽

Bag Of Words ◽

Safety Event ◽

Optimized Model ◽

Bayes Algorithm

Abstract Background Patient safety event reports provide valuable insight into systemic safety issues but deriving insights from these reports requires computational tools to efficiently parse through large volumes of qualitative data. Natural language processing (NLP) combined with predictive learning provides an automated approach to evaluating these data and supporting the work of patient safety analysts. Objectives The objective of this study was to use NLP and machine learning techniques to develop a generalizable, scalable, and reliable approach to classifying event reports for the purpose of driving improvements in the safety and quality of patient care. Methods Datasets for 14 different labels (themes) were vectorized using a bag-of-words, tf-idf, or document embeddings approach and then applied to a series of classification algorithms via a hyperparameter grid search to derive an optimized model. Reports were also analyzed for terms strongly associated with each theme using an adjusted F-score calculation. Results F1 score for each optimized model ranged from 0.951 (“Fall”) to 0.544 (“Environment”). The bag-of-words approach proved optimal for 12 of 14 labels, and the naïve Bayes algorithm performed best for nine labels. Linear support vector machine was demonstrated as optimal for three labels and XGBoost for four of the 14 labels. Labels with more distinctly associated terms performed better than less distinct themes, as shown by a Pearson's correlation coefficient of 0.634. Conclusions We were able to demonstrate an analytical pipeline that broadly applies NLP and predictive modeling to categorize patient safety reports from multiple facilities. This pipeline allows analysts to more rapidly identify and structure information contained in patient safety data, which can enhance the evaluation and the use of this information over time.

Evaluation Metrics for Health Chatbots: A Delphi Study

Methods of Information in Medicine ◽

10.1055/s-0041-1736664 ◽

2021 ◽

Author(s):

Kerstin Denecke ◽

Alaa Abd-Alrazaq ◽

Mowafa Househ ◽

Jim Warren

Keyword(s):

Delphi Study ◽

Research Literature ◽

Ease Of Use ◽

Evaluation Framework ◽

Evaluation Metrics ◽

Sensitive Data ◽

The Third ◽

The Core ◽

Core Set ◽

Font Type

Abstract Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear. Objectives The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics. Methods We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics). Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon—only moderate or low consensus was achieved for those metrics. Conclusion The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.

Status of AI-Enabled Clinical Decision Support Systems Implementations in China

Methods of Information in Medicine ◽

10.1055/s-0041-1736461 ◽

2021 ◽

Author(s):

Mengting Ji ◽

Xiaoyun Chen ◽

Georgi Z. Genchev ◽

Mingyue Wei ◽

Guangjun Yu

Keyword(s):

Decision Support ◽

Decision Support Systems ◽

Clinical Decision Support ◽

Support Systems ◽

Descriptive Analysis ◽

Clinical Decision Support Systems ◽

Clinical Decision ◽

Fisher Exact Test ◽

Process Data ◽

Actual System

Abstract Background AI-enabled Clinical Decision Support Systems (AI + CDSSs) were heralded to contribute greatly to the advancement of health care services. There is an increased availability of monetary funds and technical expertise invested in projects and proposals targeting the building and implementation of such systems. Therefore, understanding the actual system implementation status in clinical practice is imperative. Objectives The aim of the study is to understand (1) the current situation of AI + CDSSs clinical implementations in Chinese hospitals and (2) concerns regarding AI + CDSSs current and future implementations. Methods We investigated 160 tertiary hospitals from six provinces and province-level cities. Descriptive analysis, two-sided Fisher exact test, and Mann-Whitney U-test were utilized for analysis. Results Thirty-eight of the surveyed hospitals (23.75%) had implemented AI + CDSSs. There were statistical differences on grade, scales, and medical volume between the two groups of hospitals (implemented vs. not-implemented AI + CDSSs, p <0.05). On the 5-point Likert scale, 81.58% (31/38) of respondents rated their overall satisfaction with the systems as “just neutral” to “satisfied.” The three most common concerns were system functions improvement and integration into the clinical process, data quality and availability, and methodological bias. Conclusion While AI + CDSSs were not yet widespread in Chinese clinical settings, professionals recognize the potential benefits and challenges regarding in-hospital AI + CDSSs.

Towards the Representation of Network Assets in Health Care Environments Using Ontologies

Methods of Information in Medicine ◽

10.1055/s-0041-1735621 ◽

2021 ◽

Author(s):

Lucía Prieto Santamaría ◽

David Fernández Lobón ◽

Antonio Jesús Díaz-Honrubia ◽

Ernestina Menasalvas Ruiz ◽

Sokratis Nifakos ◽

...

Keyword(s):

Health Care ◽

Medical Information ◽

Clear Understanding ◽

Multiple Sources ◽

Description Language ◽

Health Care Institutions ◽

Public Body ◽

Fair Principles ◽

Federated Queries ◽

Network Administrators

Abstract Objectives The aim of the study is to design an ontology model for the representation of assets and its features in distributed health care environments. Allow the interchange of information about these assets through the use of specific vocabularies based on the use of ontologies. Methods Ontologies are a formal way to represent knowledge by means of triples composed of a subject, a predicate, and an object. Given the sensitivity of network assets in health care institutions, this work by using an ontology-based representation of information complies with the FAIR principles. Federated queries to the ontology systems, allow users to obtain data from multiple sources (i.e., several hospitals belonging to the same public body). Therefore, this representation makes it possible for network administrators in health care institutions to have a clear understanding of possible threats that may emerge in the network. Results As a result of this work, the “Software Defined Networking Description Language—CUREX Asset Discovery Tool Ontology” (SDNDL-CAO) has been developed. This ontology uses the main concepts in network assets to represent the knowledge extracted from the distributed health care environments: interface, device, port, service, etc. Conclusion The developed SDNDL-CAO ontology allows to represent the aforementioned knowledge about the distributed health care environments. Network administrators of these institutions will benefit as they will be able to monitor emerging threats in real-time, something critical when managing personal medical information.

Natural Language Mapping of Electrocardiogram Interpretations to a Standardized Ontology

Methods of Information in Medicine ◽

10.1055/s-0041-1736312 ◽

2021 ◽

Author(s):

Richard H. Epstein ◽

Yuel-Kai Jean ◽

Roman Dudaryk ◽

Robert E. Freundlich ◽

Jeremy P. Walco ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Query Language ◽

Diagnostic Category ◽

Training Dataset ◽

Qtc Interval ◽

Test Dataset ◽

Reporting Systems ◽

Computable Representation

Abstract Background Interpretations of the electrocardiogram (ECG) are often prepared using software outside the electronic health record (EHR) and imported via an interface as a narrative note. Thus, natural language processing is required to create a computable representation of the findings. Challenges include misspellings, nonstandard abbreviations, jargon, and equivocation in diagnostic interpretations. Objectives Our objective was to develop an algorithm to reliably and efficiently extract such information and map it to the standardized ECG ontology developed jointly by the American Heart Association, the American College of Cardiology Foundation, and the Heart Rhythm Society. The algorithm was to be designed to be easily modifiable for use with EHRs and ECG reporting systems other than the ones studied. Methods An algorithm using natural language processing techniques was developed in structured query language to extract and map quantitative and diagnostic information from ECG narrative reports to the cardiology societies' standardized ECG ontology. The algorithm was developed using a training dataset of 43,861 ECG reports and applied to a test dataset of 46,873 reports. Results Accuracy, precision, recall, and the F1-measure were all 100% in the test dataset for the extraction of quantitative data (e.g., PR and QTc interval, atrial and ventricular heart rate). Performances for matches in each diagnostic category in the standardized ECG ontology were all above 99% in the test dataset. The processing speed was approximately 20,000 reports per minute. We externally validated the algorithm from another institution that used a different ECG reporting system and found similar performance. Conclusion The developed algorithm had high performance for creating a computable representation of ECG interpretations. Software and lookup tables are provided that can easily be modified for local customization and for use with other EHR and ECG reporting systems. This algorithm has utility for research and in clinical decision-support where incorporation of ECG findings is desired.

Methods of Information in Medicine
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Thieme (Methods Of Information In Medicine)

Ambiguous and Incomplete: Natural Language Processing Reveals Problematic Reporting Styles in Thyroid Ultrasound Reports

Human Versus Machine: How Do We Know Who Is Winning? ROC Analysis for Comparing Human and Machine Performance under Varying Cost-Prevalence Assumptions

Measurement Performance of Activity Measurements with Newer Generation of Apple Watch in Wheelchair Users with Spinal Cord Injury

A Data-Driven Iterative Approach for Semi-automatically Assessing the Correctness of Medication Value Sets: A Proof of Concept Based on Opioids

A Semi-Automated Term Harmonization Pipeline Applied to Pulmonary Arterial Hypertension Clinical Trials

Developing an Analytical Pipeline to Classify Patient Safety Event Reports Using Optimized Predictive Algorithms

Evaluation Metrics for Health Chatbots: A Delphi Study

Status of AI-Enabled Clinical Decision Support Systems Implementations in China

Towards the Representation of Network Assets in Health Care Environments Using Ontologies

Natural Language Mapping of Electrocardiogram Interpretations to a Standardized Ontology

Export Citation Format

Methods of Information in MedicineLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Thieme (Methods Of Information In Medicine)

Ambiguous and Incomplete: Natural Language Processing Reveals Problematic Reporting Styles in Thyroid Ultrasound Reports

Human Versus Machine: How Do We Know Who Is Winning? ROC Analysis for Comparing Human and Machine Performance under Varying Cost-Prevalence Assumptions

Measurement Performance of Activity Measurements with Newer Generation of Apple Watch in Wheelchair Users with Spinal Cord Injury

A Data-Driven Iterative Approach for Semi-automatically Assessing the Correctness of Medication Value Sets: A Proof of Concept Based on Opioids

A Semi-Automated Term Harmonization Pipeline Applied to Pulmonary Arterial Hypertension Clinical Trials

Developing an Analytical Pipeline to Classify Patient Safety Event Reports Using Optimized Predictive Algorithms

Evaluation Metrics for Health Chatbots: A Delphi Study

Status of AI-Enabled Clinical Decision Support Systems Implementations in China

Towards the Representation of Network Assets in Health Care Environments Using Ontologies

Natural Language Mapping of Electrocardiogram Interpretations to a Standardized Ontology

Methods of Information in Medicine
Latest Publications