Cabernet: A Question-and-Answer System to Extract Data from Free-Text Pathology Reports (Preprint)

2021 ◽  
Author(s):  
Joseph Ross Mitchell ◽  
Phillip Szepietowski ◽  
Rachel Howard ◽  
Phillip Reisman ◽  
Jennie D. Jones ◽  
...  

BACKGROUND Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, Bidirectional Encoder Representations from Transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question-answering, named entity recognition, speech recognition, and more. OBJECTIVE to develop a BERT-based system to automatically extract detailed tumor site and histology information from free text pathology reports. METHODS We pursued three specific aims: 1) extract accurate tumor site and histology descriptions from free-text pathology reports; 2) accommodate the diverse terminology used to indicate the same pathology; and 3) provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language-model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a Q&A “head” that would connect to, and work with, the pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: 1) “What organ contains the tumor?”; and, 2) “What is the kind of tumor or carcinoma?”. This involved supervised training on 8,197 pathology reports, each with ground truth answers to these two questions determined by Certified Tumor Registrars. The dataset included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes, and the second to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this caBERTnet (pronounced “Cabernet”). We evaluated caBERnet using a sequestered test dataset of 2,050 pathology reports with ground truth answers determined by Certified Tumor Registrars. RESULTS caBERTnet’s accuracies for predicting group-level site and histology codes were 93.5% and 97.7%, respectively. The top-5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training dataset were 93.6% and 95.4%, respectively. CONCLUSIONS This is the first time an NLP system has achieved expert-level performance predicting ICD-O-3 codes across a broad range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

2021 ◽  
Vol 21 (S2) ◽  
Author(s):  
Feihong Yang ◽  
Xuwen Wang ◽  
Hetong Ma ◽  
Jiao Li

Abstract Background Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.


2022 ◽  
Author(s):  
John Caskey ◽  
Iain L McConnell ◽  
Madeline Oguss ◽  
Dmitriy Dligach ◽  
Rachel Kulikoff ◽  
...  

BACKGROUND In Wisconsin, COVID-19 case interview forms contain free text fields that need to be mined to identify potential outbreaks for targeted policy making. We developed an automated pipeline to ingest the free text into a pre-trained neural language model to identify businesses and facilities as outbreaks. OBJECTIVE We aim to examine the performance of our pipeline. METHODS Data on cases of COVID-19 were extracted from the Wisconsin Electronic Disease Surveillance System (WEDSS) for Dane County between July 1, 2020, and June 30, 2021. Features from the case interview forms were fed into a Bidirectional Encoder Representations from Transformers (BERT) model that was fine-tuned for named entity recognition (NER). We also developed a novel location mapping tool to provide addresses for relevant NERs. The pipeline was validated against known outbreaks that were already investigated and confirmed. RESULTS There were 46,898 cases of COVID-19 with 4,183,273 total BERT tokens and 15,051 unique tokens. The recall and precision of the NER tool were 0.67 (95 % CI 0.66-0.68) and 0.55 (95 % CI: 0.54-0.57), respectively. For the location mapping tool, the recall and precision were 0.93 (95% CI: 0.92-0.95) and 0.93 (95% CI: 0.92-0.95), respectively. Across monthly intervals, the NER tool identified more potential clusters than were confirmed in the WEDSS system. CONCLUSIONS We developed a novel pipeline of tools that identified existing outbreaks and novel clusters with associated addresses. Our pipeline ingests data from a statewide database and may be deployed to assist local health departments for targeted interventions. CLINICALTRIAL Not applicable


Author(s):  
Keno K Bressem ◽  
Lisa C Adams ◽  
Robert A Gaudin ◽  
Daniel Tröltzsch ◽  
Bernd Hamm ◽  
...  

Abstract Motivation The development of deep, bidirectional transformers such as Bidirectional Encoder Representations from Transformers (BERT) led to an outperformance of several Natural Language Processing (NLP) benchmarks. Especially in radiology, large amounts of free-text data are generated in daily clinical workflow. These report texts could be of particular use for the generation of labels in machine learning, especially for image classification. However, as report texts are mostly unstructured, advanced NLP methods are needed to enable accurate text classification. While neural networks can be used for this purpose, they must first be trained on large amounts of manually labelled data to achieve good results. In contrast, BERT models can be pre-trained on unlabelled data and then only require fine tuning on a small amount of manually labelled data to achieve even better results. Results Using BERT to identify the most important findings in intensive care chest radiograph reports, we achieve areas under the receiver operation characteristics curve of 0.98 for congestion, 0.97 for effusion, 0.97 for consolidation and 0.99 for pneumothorax, surpassing the accuracy of previous approaches with comparatively little annotation effort. Our approach could therefore help to improve information extraction from free-text medical reports. Availability and implementation We make the source code for fine-tuning the BERT-models freely available at https://github.com/fast-raidiology/bert-for-radiology. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Mario Jojoa Acosta ◽  
Gema Castillo-Sánchez ◽  
Begonya Garcia-Zapirain ◽  
Isabel de la Torre Díez ◽  
Manuel Franco-Martín

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Tomasz Oliwa ◽  
Steven B. Maron ◽  
Leah M. Chase ◽  
Samantha Lomnicki ◽  
Daniel V.T. Catenacci ◽  
...  

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.


2019 ◽  
Author(s):  
Auss Abbood ◽  
Alexander Ullrich ◽  
Rüdiger Busche ◽  
Stéphane Ghozzi

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.


Blood ◽  
2008 ◽  
Vol 112 (11) ◽  
pp. 2394-2394
Author(s):  
Bill Long ◽  
Aliyah Rahemtullah ◽  
Christiana E. Toomey ◽  
Adam Ackerman ◽  
Jeremy S. Abramson ◽  
...  

Abstract Introduction: Epidemiologic research in hematologic malignancies has been dependent on three sources of data; SEER data, patients accrued to large clinical trials, and databases generated by individual providers or departments. Each of these sets of data have major limitations. In theory the adoption of computerized databases by pathology departments and the adoption of electronic medical records should have greatly expanded the ability to identify patients with particular types of malignancies. In practice the need to manually classify tens of thousands of free text pathology reports has made this resource unavailable. We have developed a computer program to extract and codify the final diagnoses described in free text pathology reports according to the World Health Organization (WHO) classification of hematologic malignancies. This enables us to identify sets of cases of desired pathologies for further study. Methods: A medical records review protocol was approved by the Dana Farber Harvard Cancer Center Institutional Review Board. Using our clinical lymphoma database we collected records of patients at Massachusetts General Hospital (MGH) Cancer Center who carried a diagnosis of follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL).The program is modified from one developed to extract diagnoses from discharge summaries in a different context [Long, AMIA 2007]. The approach is to use punctuation and a few words (conjunctions and some common verbs) to divide the text into phrases and then use a search procedure to find the most specific matching concepts in the UMLS (Unified Medical Language System). The search uses all of the alternate phrases for each concept as included in the UMLS normalized string table, matched against normalized subphrases from the text. We mapped UMLS concepts to the desired WHO concepts. To ensure a complete list of UMLS concepts we used the hierarchical relations in the UMLS and examined both more specific and more general concepts for possible inclusion in the search. Since not all diseases in a report are part of the diagnosis, we developed pattern matching procedures to identify the parts of the pathology report containing the final diagnosis (as opposed to “note” or “clinical data” that might have patient specific clinical information that are not diagnosed in the current pathology report). The program also uses a strategy for identifying modifiers that change the sense of the diagnosis (“suggestive of”, “rule out”, “no”, etc.) to exclude diseases that are absent or only possible. Results: We used the system to identify cases of FL and DLBCL and compared the results to lists of cases generated manually. The current program was 90% accurate in automatically classifying pathology reports as describing follicular lymphoma. Of 150 cases of FL, the program found 133 (eg, MALIGNANT LYMPHOMA, FOLLICLE CENTER) and 3 were in reports not available to the program (133/147 90%). Of 100 DLBCL cases, 76 were available and 63 were found (83%) (eg, HISTIOCYTE-RICH LARGE B-CELL LYMPHOMA). There were several reasons for the missed cases. Most commonly (13) the diagnosis was not in the identified diagnosis section, either because the section was divided by a note or the diagnosis was in an addendum. Only 3 cases had phrases not found in the UMLS. For 2 others, the program missed because the phrase was not contiguous (eg, B-CELL LYMPHOMA, CONSISTENT WITH FOLLICLE CENTER CELL TYPE). In 5 of the DLBCL cases the stated final diagnosis was more general than the desired diagnoses and 2 of the FL cases were listed as “strongly suggestive” which the program concluded was not a definite diagnosis. Discussion: This program is useful for identifying desired cohorts of cases and can be improved with better identification of the sections containing the diagnosis, the addition of a few missing phrases as they are discovered, and the addition of techniques for handling common discontinuities in disease descriptions (eg, allowing “consistent with” in the phrase). We have shown that very simple natural language techniques are sufficient to extract most of the desired disease descriptions from free text reports, enabling automatic selection of cases and greatly enhancing the usefulness of large repositories of pathology reports.


2020 ◽  
Vol 10 (18) ◽  
pp. 6429
Author(s):  
SungMin Yang ◽  
SoYeop Yoo ◽  
OkRan Jeong

Along with studies on artificial intelligence technology, research is also being carried out actively in the field of natural language processing to understand and process people’s language, in other words, natural language. For computers to learn on their own, the skill of understanding natural language is very important. There are a wide variety of tasks involved in the field of natural language processing, but we would like to focus on the named entity registration and relation extraction task, which is considered to be the most important in understanding sentences. We propose DeNERT-KG, a model that can extract subject, object, and relationships, to grasp the meaning inherent in a sentence. Based on the BERT language model and Deep Q-Network, the named entity recognition (NER) model for extracting subject and object is established, and a knowledge graph is applied for relation extraction. Using the DeNERT-KG model, it is possible to extract the subject, type of subject, object, type of object, and relationship from a sentence, and verify this model through experiments.


Author(s):  
A. Evtushenko

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP).  In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB.  This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.


2006 ◽  
Vol 130 (12) ◽  
pp. 1825-1829 ◽  
Author(s):  
Manjula Murari ◽  
Rakesh Pandey

Abstract Context.—Advances in information technology have made electronic systems productive tools for pathology report generation. Structured data formats are recommended for better understanding of pathology reports by clinicians and for retrieval of pathology reports. Suitable formats need to be developed to include structured data elements for report generation in electronic systems. Objective.—To conform to the requirement of protocol-based reporting and to provide uniform and standardized data entry and retrieval, we developed a synoptic reporting system for generation of bone marrow cytology and histology reports for incorporation into our hospital information system. Design.—A combination of macro text, short preformatted templates of tabular data entry sheets, and canned files was developed using a text editor enabling protocol-based input. The system is flexible and has facility for appending free text entry. It also incorporates SNOMED coding and codes for teaching, research, and internal auditing. Results.—This synoptic reporting system is easy to use and adaptable. Features and advantages include pick-up text with defined choices, flexibility for appending free text, facility for data entry for protocol-based reports for research use, standardized and uniform format of reporting, comparable follow-up reports, minimized typographical and transcription errors, and saving on reporting time, thus helping shorten the turnaround time. Conclusions.—Simple structured pathology report templates are a powerful means for supporting uniformity in reporting as well as subsequent data viewing and extraction, particularly suitable to computerized reporting.


Sign in / Sign up

Export Citation Format

Share Document