Do Neural Information Extraction Algorithms Generalize Across Institutions?

PURPOSE Natural language processing (NLP) techniques have been adopted to reduce the curation costs of electronic health records. However, studies have questioned whether such techniques can be applied to data from previously unseen institutions. We investigated the performance of a common neural NLP algorithm on data from both known and heldout (ie, institutions whose data were withheld from the training set and only used for testing) hospitals. We also explored how diversity in the training data affects the system’s generalization ability. METHODS We collected 24,881 breast pathology reports from seven hospitals and manually annotated them with nine key attributes that describe types of atypia and cancer. We trained a convolutional neural network (CNN) on annotations from either only one (CNN1), only two (CNN2), or only four (CNN4) hospitals. The trained systems were tested on data from five organizations, including both known and heldout ones. For every setting, we provide the accuracy scores as well as the learning curves that show how much data are necessary to achieve good performance and generalizability. RESULTS The system achieved a cross-institutional accuracy of 93.87% when trained on reports from only one hospital (CNN1). Performance improved to 95.7% and 96%, respectively, when the system was trained on reports from two (CNN2) and four (CNN4) hospitals. The introduction of diversity during training did not lead to improvements on the known institutions, but it boosted performance on the heldout institutions. When tested on reports from heldout hospitals, CNN4 outperformed CNN1 and CNN2 by 2.13% and 0.3%, respectively. CONCLUSION Real-world scenarios require that neural NLP approaches scale to data from previously unseen institutions. We show that a common neural NLP algorithm for information extraction can achieve this goal, especially when diverse data are used during training.

Download Full-text

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

JAMIA Open ◽

10.1093/jamiaopen/ooab085 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Briton Park ◽

Nicholas Altieri ◽

John DeNero ◽

Anobel Y Odisho ◽

Bin Yu

Keyword(s):

Natural Language ◽

Information Extraction ◽

Transfer Learning ◽

Language Processing ◽

Training Data ◽

Accurate Information ◽

Pathology Report ◽

Learning Methods ◽

String Similarity ◽

Pathology Reports

Abstract Objective We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. Materials and Methods Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. Results For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. Conclusions Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

Download Full-text

Adapting SVM for data sparseness and imbalance: a case study in information extraction

Natural Language Engineering ◽

10.1017/s1351324908004968 ◽

2009 ◽

Vol 15 (2) ◽

pp. 241-271 ◽

Cited By ~ 31

Author(s):

YAOYONG LI ◽

KALINA BONTCHEVA ◽

HAMISH CUNNINGHAM

Keyword(s):

Active Learning ◽

Language Learning ◽

Information Extraction ◽

Language Processing ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Passive Learning ◽

Wide Range

AbstractSupport Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.

Download Full-text

Pathologic findings in reduction mammoplasty procedures identified by natural language processing of breast pathology reports: A surrogate for the population incidence of cancer and high risk lesions.

Journal of Clinical Oncology ◽

10.1200/jco.2018.36.15_suppl.e13569 ◽

2018 ◽

Vol 36 (15_suppl) ◽

pp. e13569-e13569 ◽

Cited By ~ 2

Author(s):

Francisco Acevedo ◽

Rong Tang ◽

Suzanne Coopey ◽

Adam Yala ◽

Regina Barzilay ◽

...

Keyword(s):

Natural Language Processing ◽

High Risk ◽

Natural Language ◽

Language Processing ◽

Reduction Mammoplasty ◽

Breast Pathology ◽

Incidence Of Cancer ◽

Population Incidence ◽

Pathologic Findings ◽

Pathology Reports

Download Full-text

The feasibility of using natural language processing to extract clinical information from breast pathology reports

Journal of Pathology Informatics ◽

10.4103/2153-3539.97788 ◽

2012 ◽

Vol 3 (1) ◽

pp. 23 ◽

Cited By ~ 41

Author(s):

KevinS Hughes ◽

JullietteM Buckley ◽

SuzanneB Coopey ◽

John Sharko ◽

Fernanda Polubriaginof ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Information ◽

Breast Pathology ◽

Pathology Reports

Download Full-text

Can We Survive without Labelled Data in NLP? Transfer Learning for Open Information Extraction

Applied Sciences ◽

10.3390/app10175758 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5758

Author(s):

Injy Sarhan ◽

Marco Spruit

Keyword(s):

Information Extraction ◽

Transfer Learning ◽

Language Processing ◽

Relation Extraction ◽

Training Data ◽

Open Information Extraction ◽

Comparable Performance ◽

Medical Domain ◽

Inductive Transfer Learning ◽

F Measure

Various tasks in natural language processing (NLP) suffer from lack of labelled training data, which deep neural networks are hungry for. In this paper, we relied upon features learned to generate relation triples from the open information extraction (OIE) task. First, we studied how transferable these features are from one OIE domain to another, such as from a news domain to a bio-medical domain. Second, we analyzed their transferability to a semantically related NLP task, namely, relation extraction (RE). We thereby contribute to answering the question: can OIE help us achieve adequate NLP performance without labelled data? Our results showed comparable performance when using inductive transfer learning in both experiments by relying on a very small amount of the target data, wherein promising results were achieved. When transferring to the OIE bio-medical domain, we achieved an F-measure of 78.0%, only 1% lower when compared to traditional learning. Additionally, transferring to RE using an inductive approach scored an F-measure of 67.2%, which was 3.8% lower than training and testing on the same task. Hereby, our analysis shows that OIE can act as a reliable source task.

Download Full-text

Development of an algorithm to identify metastatic prostate cancer in electronic medical records using natural language processing.

Journal of Clinical Oncology ◽

10.1200/jco.2014.32.30_suppl.164 ◽

2014 ◽

Vol 32 (30_suppl) ◽

pp. 164-164 ◽

Cited By ~ 1

Author(s):

Lauren P. Wallner ◽

Julia R. Dibello ◽

Bonnie H. Li ◽

Chengyi Zheng ◽

Wei Yu ◽

...

Keyword(s):

Prostate Cancer ◽

Natural Language Processing ◽

Language Processing ◽

Metastatic Prostate Cancer ◽

Metastatic Cancer ◽

Quality Care ◽

Training Set ◽

High Quality ◽

Radiology Reports ◽

Pathology Reports

164 Background: Prostate cancer patients who develop metastases are a difficult population to identify through administrative diagnostic codes, due to their protracted time to metastases, limited survival and the inconsistent use of specific codes. As a result, research that is needed to inform the delivery of high-quality care in this setting is limited. Therefore, the goal of this study was to develop an algorithm, which utilizes EMR data to identify men who progress to metastatic prostate cancer after diagnosis using natural language processing (NLP). Methods: An electronic algorithm was developed to search unstructured text using NLP to identify progression to metastases among men with a diagnosis of prostate cancer between 1992 and 2010 in a large, diverse cohort of men who were part of an ongoing study focused on prostate cancer mortality. A training set of 449 men who were diagnosed as early stage prostate cancer was used for development. Pathology, radiology and clinic notes were searched from diagnosis until death or loss to follow-up. Pathology reports were searched for mention of adenocarcinoma in the metastatic lesion, radiology reports were searched for abnormal findings consistent with metastases, and clinic notes were searched for mentions of increasing pain or narcotic use related to metastases. Each NLP component was validated against manual review of the corresponding records. Results: Of the 449 men in the training set, 40 (8.9%) were found to have metastatic prostate cancer. The majority of cases had evidence of metastases in their clinic notes (98%). Radiology reports identified 18% of cases, and pathology reports identified 5%. Of the 40 cases identified, 25% did not have a corresponding ICD-9 codes for metastatic cancer. However, 7.5% used ADT, 37.5% had increasing oncology visits and 22.5% had rapidly rising PSA levels. Conclusions: Our results suggest that NLP can be used to identify men with metastatic prostate cancer in the EMR more accurately than diagnosis codes alone. The automated identification of patients with metastatic cancer facilitates quality of care research in this setting to ensure the delivery of appropriate and high-quality care.

Download Full-text

Deep learning for natural language processing of free-text pathology reports: a comparison of learning curves

BMJ Innovations ◽

10.1136/bmjinnov-2019-000410 ◽

2020 ◽

Vol 6 (4) ◽

pp. 192-198

Author(s):

Joeky T Senders ◽

David J Cote ◽

Alireza Mehrtash ◽

Robert Wiemann ◽

William B Gormley ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Learning Curve ◽

Language Processing ◽

Learning Curves ◽

Free Text ◽

Open Source Framework ◽

Training Examples ◽

Pathology Reports

IntroductionAlthough clinically derived information could improve patient care, its full potential remains unrealised because most of it is stored in a format unsuitable for traditional methods of analysis, free-text clinical reports. Various studies have already demonstrated the utility of natural language processing algorithms for medical text analysis. Yet, evidence on their learning efficiency is still lacking. This study aimed to compare the learning curves of various algorithms and develop an open-source framework for text mining in healthcare.MethodsDeep learning and regressions-based models were developed to determine the histopathological diagnosis of patients with brain tumour based on free-text pathology reports. For each model, we characterised the learning curve and the minimal required training examples to reach the area under the curve (AUC) performance thresholds of 0.95 and 0.98.ResultsIn total, we retrieved 7000 reports on 5242 patients with brain tumour (2316 with glioma, 1412 with meningioma and 1514 with cerebral metastasis). Conventional regression and deep learning-based models required 200–400 and 800–1500 training examples to reach the AUC performance thresholds of 0.95 and 0.98, respectively. The deep learning architecture utilised in the current study required 100 and 200 examples, respectively, corresponding to a learning capacity that is two to eight times more efficient.ConclusionsThis open-source framework enables the development of high-performing and fast learning natural language processing models. The steep learning curve can be valuable for contexts with limited training examples (eg, rare diseases and events or institutions with lower patient volumes). The resultant models could accelerate retrospective chart review, assemble clinical registries and facilitate a rapid learning healthcare system.

Download Full-text

Clinical accuracy of information extracted from prostate needle biopsy pathology reports using natural language processing.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.1557 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. 1557-1557

Author(s):

Risa Liang Wong ◽

Medha Sagar ◽

Jacob Hoffman ◽

Claire Huang ◽

Angelica Lerma ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Needle Biopsy ◽

Proof Of Concept ◽

Rule Based ◽

Prostate Needle Biopsy ◽

Clinical Accuracy ◽

Pathology Reports

1557 Background: Patients with prostate cancer are diagnosed through a prostate needle biopsy (PNB). Information contained in PNB pathology reports is critical for informing clinical risk stratification and treatment; however, patient comprehension of PNB pathology reports is low, and formats vary widely by institution. Natural language processing (NLP) models trained to automatically extract key information from unstructured PNB pathology reports could be used to generate personalized educational materials for patients in a scalable fashion and expedite the process of collecting registry data or screening patients for clinical trials. As proof of concept, we trained and tested four NLP models for accuracy of information extraction. Methods: Using 403 positive PNB pathology reports from over 80 institutions, we converted portable document formats (PDFs) into text using the Tesseract optical character recognition (OCR) engine, removed protected health information using the Philter open-source tool, cleaned the text with rule-based methods, and annotated clinically relevant attributes as well as structural attributes relevant to information extraction using the Brat Rapid Annotation Tool. Text pre-processing for classification and extraction was done using Scispacy and rule-based methods. Using a 75:25 train:test split (N = 302, 101), we tested conditional random field (CRF), support vector machine (SVM), bidirectional long-short term memory network (Bi-LSTM), and Bi-LSTM-CRF models, reserving 46 training reports as a validation subset for the latter two models. Model-extracted variables were compared with values manually obtained from the unprocessed PDF reports for clinical accuracy. Results: Clinical accuracy of model-extracted variables is reported in the Table. CRF was the highest performing model, with accuracies of 97% for Gleason grade, 82% for percentage of positive cores ( < 50% vs. ≥50%), 90% for perineural or lymphovascular invasion, and 100% for presence of non-acinar carcinoma histology. On manual review of inaccurate results, model performance was limited by PDF image quality, errors in OCR processing of tables or columns, and practice variability in reporting number of biopsy cores. Conclusions: Our results demonstrate successful proof of concept for the use of NLP models in accurately extracting information from PNB pathology reports, though further optimization is needed before use in clinical practice.[Table: see text]

Download Full-text

2. Unlocking information in electronic health records using natural language processing: a case study in medication information extraction

Text Mining of Web-based Medical Content ◽

10.1515/9781614513902.33 ◽

2014 ◽

Author(s):

Hua Xu ◽

C. Denny Joshua

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Health Records ◽

Medication Information ◽

Electronic Health

Download Full-text

A Comprehensive Typing System for Information Extraction from Clinical Narratives

10.1101/19009118 ◽

2019 ◽

Author(s):

J. Harry Caufield ◽

Yichao Zhou ◽

Yunsheng Bai ◽

David A. Liem ◽

Anders O. Garlid ◽

...

Keyword(s):

Natural Language Processing ◽

Information Extraction ◽

Language Processing ◽

Real World ◽

Case Reports ◽

Free Text ◽

Text Documents ◽

Health Records ◽

Clinical Text ◽

Typing System

AbstractWe have developed ACROBAT (Annotation for Case Reports using Open Biomedical Annotation Terms), a typing system for detailed information extraction from clinical text. This resource supports detailed identification and categorization of entities, events, and relations within clinical text documents, including clincal case reports (CCRs) and the free-text components of electronic health records. Using ACROBAT and the text of 200 CCRs, we annotated a wide variety of real-world clinical disease presentations. The resulting dataset, MACCROBAT2018, is a rich collection of annotated clinical language appropriate for training biomedical natural language processing systems.

Download Full-text