Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models (Preprint)

Mapping Intimacies ◽

10.2196/preprints.19735 ◽

2020 ◽

Author(s):

Xi Yang ◽

Xing He ◽

Hansi Zhang ◽

Yinghan Ma ◽

Jiang Bian ◽

...

Keyword(s):

Language Processing ◽

Clinical Training ◽

Pearson Correlation ◽

Ensemble Methods ◽

English Text ◽

Shared Task ◽

Training Set ◽

Clinical Text ◽

Community Effort ◽

Semantic Textual Similarity

BACKGROUND Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. OBJECTIVE This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. METHODS In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. RESULTS Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). CONCLUSIONS This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.

Download Full-text

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models

JMIR Medical Informatics ◽

10.2196/19735 ◽

2020 ◽

Vol 8 (11) ◽

pp. e19735

Author(s):

Xi Yang ◽

Xing He ◽

Hansi Zhang ◽

Yinghan Ma ◽

Jiang Bian ◽

...

Keyword(s):

Language Processing ◽

Clinical Training ◽

Pearson Correlation ◽

Ensemble Methods ◽

English Text ◽

Shared Task ◽

Training Set ◽

Clinical Text ◽

Community Effort ◽

Semantic Textual Similarity

Background Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS. Objective This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS. Methods In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models. Results Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010). Conclusions This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.

Download Full-text

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview (Preprint)

10.2196/preprints.23375 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanshan Wang ◽

Sunyang Fu ◽

Feichen Shen ◽

Sam Henry ◽

Ozlem Uzuner ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Shared Task ◽

Data Set ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Domain ◽

Semantic Textual Similarity

BACKGROUND Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semantic equivalence between 2 snippets of clinical text. Due to the frequent use of templates in the Electronic Health Record system, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in downstream clinical natural language processing applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval. OBJECTIVE Our objective was to release ClinicalSTS data sets and to motivate natural language processing and biomedical informatics communities to tackle semantic text similarity tasks in the clinical domain. METHODS We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world ClinicalSTS data set. We continued the shared task in 2019 in collaboration with National NLP Clinical Challenges (n2c2) and the Open Health Natural Language Processing (OHNLP) consortium and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS data set comprising 1642 clinical sentence pairs, including 1068 pairs from the 2018 shared task and 1006 new pairs from 2 electronic health record systems, GE and Epic. We released 80% (1642/2054) of the data to participating teams to develop and fine-tune the semantic textual similarity systems and used the remaining 20% (412/2054) as blind testing to evaluate their systems. The workshop was held in conjunction with the American Medical Informatics Association 2019 Annual Symposium. RESULTS Of the 78 international teams that signed on to the n2c2/OHNLP ClinicalSTS shared task, 33 produced a total of 87 valid system submissions. The top 3 systems were generated by IBM Research, the National Center for Biotechnology Information, and the University of Florida, with Pearson correlations of r=.9010, r=.8967, and r=.8864, respectively. Most top-performing systems used state-of-the-art neural language models, such as BERT and XLNet, and state-of-the-art training schemas in deep learning, such as pretraining and fine-tuning schema, and multitask learning. Overall, the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data being GE sentence pairs. CONCLUSIONS The 2019 n2c2/OHNLP ClinicalSTS shared task focused on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. The ClinicalSTS shared task could continue to serve as a venue for researchers in natural language processing and medical informatics communities to develop and improve semantic textual similarity techniques for clinical text.

Download Full-text

Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study (Preprint)

10.2196/preprints.23357 ◽

2020 ◽

Author(s):

Ying Xiong ◽

Shuai Chen ◽

Qingcai Chen ◽

Jun Yan ◽

Buzhou Tang

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Language Model ◽

Model Performance ◽

Clinical Text ◽

Sentence Level ◽

Level Information ◽

Semantically Enhanced ◽

Copy And Paste ◽

Semantic Textual Similarity

BACKGROUND With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.

Download Full-text

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview

JMIR Medical Informatics ◽

10.2196/23375 ◽

2020 ◽

Vol 8 (11) ◽

pp. e23375 ◽

Cited By ~ 2

Author(s):

Yanshan Wang ◽

Sunyang Fu ◽

Feichen Shen ◽

Sam Henry ◽

Ozlem Uzuner ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Shared Task ◽

Data Set ◽

Clinical Text ◽

Clinical Notes ◽

Clinical Domain ◽

Semantic Textual Similarity

Background Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semantic equivalence between 2 snippets of clinical text. Due to the frequent use of templates in the Electronic Health Record system, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in downstream clinical natural language processing applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval. Objective Our objective was to release ClinicalSTS data sets and to motivate natural language processing and biomedical informatics communities to tackle semantic text similarity tasks in the clinical domain. Methods We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world ClinicalSTS data set. We continued the shared task in 2019 in collaboration with National NLP Clinical Challenges (n2c2) and the Open Health Natural Language Processing (OHNLP) consortium and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS data set comprising 1642 clinical sentence pairs, including 1068 pairs from the 2018 shared task and 1006 new pairs from 2 electronic health record systems, GE and Epic. We released 80% (1642/2054) of the data to participating teams to develop and fine-tune the semantic textual similarity systems and used the remaining 20% (412/2054) as blind testing to evaluate their systems. The workshop was held in conjunction with the American Medical Informatics Association 2019 Annual Symposium. Results Of the 78 international teams that signed on to the n2c2/OHNLP ClinicalSTS shared task, 33 produced a total of 87 valid system submissions. The top 3 systems were generated by IBM Research, the National Center for Biotechnology Information, and the University of Florida, with Pearson correlations of r=.9010, r=.8967, and r=.8864, respectively. Most top-performing systems used state-of-the-art neural language models, such as BERT and XLNet, and state-of-the-art training schemas in deep learning, such as pretraining and fine-tuning schema, and multitask learning. Overall, the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data being GE sentence pairs. Conclusions The 2019 n2c2/OHNLP ClinicalSTS shared task focused on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. The ClinicalSTS shared task could continue to serve as a venue for researchers in natural language processing and medical informatics communities to develop and improve semantic textual similarity techniques for clinical text.

Download Full-text

Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study

JMIR Medical Informatics ◽

10.2196/23357 ◽

2020 ◽

Vol 8 (12) ◽

pp. e23357

Author(s):

Ying Xiong ◽

Shuai Chen ◽

Qingcai Chen ◽

Jun Yan ◽

Buzhou Tang

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Language Model ◽

Model Performance ◽

Clinical Text ◽

Sentence Level ◽

Level Information ◽

Semantically Enhanced ◽

Copy And Paste ◽

Semantic Textual Similarity

Background With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. Objective In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. Methods We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). Results We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). Conclusions Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.

Download Full-text

Extending BERT for Clinical Semantic Textual Similarity (Preprint)

10.2196/preprints.22795 ◽

2020 ◽

Author(s):

Klaus Kades ◽

Jan Sellner ◽

Gregor Koehler ◽

Peter M. Full ◽

T.Y. Emmy Lai ◽

...

Keyword(s):

Correlation Coefficient ◽

Pearson Correlation ◽

Language Models ◽

Training Dataset ◽

Pearson Correlation Coefficient ◽

Text Data ◽

Test Dataset ◽

Clinical Text ◽

Starting Point ◽

Semantic Textual Similarity

BACKGROUND Natural Language Understanding enables automatic extraction of relevant information from clinical text data which are acquired every day in hospitals. In 2018, the language model BERT was introduced generating new state of the art results on several downstream tasks. The National NLP Clinical Challenges (n2c2) was initiated to tackle such downstream tasks on clinical text data where domain adapted methods might be a way to further improve language models like BERT. OBJECTIVE Optimally leverage BERT for the task of semantic textual similarity on clinical text data. METHODS We used BERT as an initial baseline and analysed its results which we used as a starting point to develop three different approaches where we (1) added additional, handcrafted sentence similarity features to the classifier token of BERT and combined the results with more features in multiple regression estimators, (2) incorporated a built-in ensembling method, M-Heads, into BERT by duplicating the regression head and applying an adapted training strategy to facilitate the focus of the heads on different input patterns of the medical sentences and (3) developed a graph-based similarity approach for medications which allows extrapolating similarities across known entities from the training set. The approaches were evaluated with the Pearson correlation coefficient between the predicted scores and ground truth on the official training and test dataset. RESULTS We improve the performance of BERT on the test dataset from a Pearson correlation coefficient of 0.859 to 0.883 using a combination of the M-Heads and the graph-based similarity approach. We also show differences between the test and training dataset and how they influence the results. CONCLUSIONS We found that using a graph-based similarity approach has the potential to extrapolate domain specific knowledge to unseen sentences. For the evaluation, we observed that it is easily possible to get deceived by results on the test dataset especially when the distribution of the data samples is different between the training and test datasets.

Download Full-text

Incorporating Domain Knowledge Into Language Models Using Graph Convolutional Networks for Clinical Semantic Textual Similarity (Preprint)

10.2196/preprints.23101 ◽

2020 ◽

Author(s):

David Chang ◽

Eric Lin ◽

Cynthia Brandt ◽

Richard Andrew Taylor

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Data Augmentation ◽

Pearson Correlation ◽

Clinical Documentation ◽

Convolutional Networks ◽

Knowledge Distillation ◽

Semantic Textual Similarity

BACKGROUND While electronic health record systems have facilitated clinical documentation in healthcare, they also introduce new challenges such as the proliferation of redundant information through copy-and-paste commands or templates. One approach to trim down bloated clinical documentation and improve clinical summarization is to identify highly similar text snippets for the goal of removing such text. OBJECTIVE We develop a natural language processing system for the task of clinical semantic textual similarity that assigns scores to pairs of clinical text snippets based on their clinical semantic similarity. METHODS We leverage recent advances in natural language processing and graph representation learning to create a model that combines linguistic and domain knowledge information from the MedSTS dataset to assess clinical semantic textual similarity. We use Bidirectional Encoder Representation from Transformers (BERT)¬–based models as text encoders for the sentence pairs in the dataset and graph convolutional networks (GCNs) as graph encoders for corresponding concept graphs constructed based on the sentences. We also explore techniques including data augmentation, ensembling, and knowledge distillation to improve the performance as measured by Pearson correlation. RESULTS Fine–tuning BERT-base and ClinicalBERT on the MedSTS dataset provided a strong baseline (0.842 and 0.848 Pearson correlation, respectively) compared to the previous year’s submissions. Our data augmentation techniques yielded moderate gains in performance, and adding a GCN–based graph encoder to incorporate the concept graphs also boosted performance, especially when the node features were initialized with pretrained knowledge graph embeddings of the concepts (0.868). As expected, ensembling improved performance, and multi–source ensembling using different language model variants, conducting knowledge distillation on the multi–source ensemble model, and taking a final ensemble of the distilled models further improved the system’s performance (0.875, 0.878, and 0.882, respectively). CONCLUSIONS We develop a system for the MedSTS clinical semantic textual similarity benchmark task by combining BERT–based text encoders and GCN–based graph encoders in order to incorporate domain knowledge into the natural language processing pipeline. We also experiment with other techniques involving data augmentation, pretrained concept embeddings, ensembling, and knowledge distillation to further increase our performance.

Download Full-text

Developing the Persian Wordnet of Verbs Using Supervised Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3450969 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-18

Author(s):

Zahra Mousavi ◽

Heshaam Faili

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Supervised Classification ◽

Word Sense ◽

Direct Influence ◽

Training Set ◽

Bilingual Dictionary ◽

Automated Method ◽

Princeton Wordnet

Nowadays, wordnets are extensively used as a major resource in natural language processing and information retrieval tasks. Therefore, the accuracy of wordnets has a direct influence on the performance of the involved applications. This paper presents a fully-automated method for extending a previously developed Persian wordnet to cover more comprehensive and accurate verbal entries. At first, by using a bilingual dictionary, some Persian verbs are linked to Princeton WordNet synsets. A feature set related to the semantic behavior of compound verbs as the majority of Persian verbs is proposed. This feature set is employed in a supervised classification system to select the proper links for inclusion in the wordnet. We also benefit from a pre-existing Persian wordnet, FarsNet, and a similarity-based method to produce a training set. This is the largest automatically developed Persian wordnet with more than 27,000 words, 28,000 PWN synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 PWN synsets and 38,000 word-sense pairs.

Download Full-text

A deep database of medical abbreviations and acronyms for natural language processing

Scientific Data ◽

10.1038/s41597-021-00929-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Lisa Grossman Liu ◽

Raymond H. Grossman ◽

Elliot G. Mitchell ◽

Chunhua Weng ◽

Karthik Natarajan ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

American English ◽

Substantial Improvement ◽

Future Application ◽

Multiple Sources ◽

High Coverage ◽

Clinical Text ◽

Automated Quality Control

AbstractThe recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text