scholarly journals Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis (Preprint)

2020 ◽  
Author(s):  
Mark Ormerod ◽  
Jesús Martínez del Rincón ◽  
Barry Devereux

BACKGROUND Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations. OBJECTIVE We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models. METHODS Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model’s loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model. RESULTS Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model’s loss, we identified the system’s failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)–style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set. CONCLUSIONS We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model’s outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these “black box” models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP.

10.2196/23099 ◽  
2021 ◽  
Vol 9 (5) ◽  
pp. e23099
Author(s):  
Mark Ormerod ◽  
Jesús Martínez del Rincón ◽  
Barry Devereux

Background Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations. Objective We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models. Methods Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model’s loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model. Results Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model’s loss, we identified the system’s failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)–style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set. Conclusions We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model’s outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these “black box” models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP.


Author(s):  
Yanshan Wang ◽  
Sunyang Fu ◽  
Feichen Shen ◽  
Sam Henry ◽  
Ozlem Uzuner ◽  
...  

BACKGROUND Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semantic equivalence between 2 snippets of clinical text. Due to the frequent use of templates in the Electronic Health Record system, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in downstream clinical natural language processing applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval. OBJECTIVE Our objective was to release ClinicalSTS data sets and to motivate natural language processing and biomedical informatics communities to tackle semantic text similarity tasks in the clinical domain. METHODS We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world ClinicalSTS data set. We continued the shared task in 2019 in collaboration with National NLP Clinical Challenges (n2c2) and the Open Health Natural Language Processing (OHNLP) consortium and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS data set comprising 1642 clinical sentence pairs, including 1068 pairs from the 2018 shared task and 1006 new pairs from 2 electronic health record systems, GE and Epic. We released 80% (1642/2054) of the data to participating teams to develop and fine-tune the semantic textual similarity systems and used the remaining 20% (412/2054) as blind testing to evaluate their systems. The workshop was held in conjunction with the American Medical Informatics Association 2019 Annual Symposium. RESULTS Of the 78 international teams that signed on to the n2c2/OHNLP ClinicalSTS shared task, 33 produced a total of 87 valid system submissions. The top 3 systems were generated by IBM Research, the National Center for Biotechnology Information, and the University of Florida, with Pearson correlations of <i>r</i>=.9010, <i>r</i>=.8967, and <i>r</i>=.8864, respectively. Most top-performing systems used state-of-the-art neural language models, such as BERT and XLNet, and state-of-the-art training schemas in deep learning, such as pretraining and fine-tuning schema, and multitask learning. Overall, the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data being GE sentence pairs. CONCLUSIONS The 2019 n2c2/OHNLP ClinicalSTS shared task focused on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. The ClinicalSTS shared task could continue to serve as a venue for researchers in natural language processing and medical informatics communities to develop and improve semantic textual similarity techniques for clinical text.


Author(s):  
Mourad Oussalah ◽  
Muhidin Mohamed

AbstractDetermining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet “All word-To-Noun conversion” that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.


10.2196/23375 ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. e23375 ◽  
Author(s):  
Yanshan Wang ◽  
Sunyang Fu ◽  
Feichen Shen ◽  
Sam Henry ◽  
Ozlem Uzuner ◽  
...  

Background Semantic textual similarity is a common task in the general English domain to assess the degree to which the underlying semantics of 2 text segments are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the semantic textual similarity task in the clinical domain that attempts to measure the degree of semantic equivalence between 2 snippets of clinical text. Due to the frequent use of templates in the Electronic Health Record system, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in downstream clinical natural language processing applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval. Objective Our objective was to release ClinicalSTS data sets and to motivate natural language processing and biomedical informatics communities to tackle semantic text similarity tasks in the clinical domain. Methods We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world ClinicalSTS data set. We continued the shared task in 2019 in collaboration with National NLP Clinical Challenges (n2c2) and the Open Health Natural Language Processing (OHNLP) consortium and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS data set comprising 1642 clinical sentence pairs, including 1068 pairs from the 2018 shared task and 1006 new pairs from 2 electronic health record systems, GE and Epic. We released 80% (1642/2054) of the data to participating teams to develop and fine-tune the semantic textual similarity systems and used the remaining 20% (412/2054) as blind testing to evaluate their systems. The workshop was held in conjunction with the American Medical Informatics Association 2019 Annual Symposium. Results Of the 78 international teams that signed on to the n2c2/OHNLP ClinicalSTS shared task, 33 produced a total of 87 valid system submissions. The top 3 systems were generated by IBM Research, the National Center for Biotechnology Information, and the University of Florida, with Pearson correlations of r=.9010, r=.8967, and r=.8864, respectively. Most top-performing systems used state-of-the-art neural language models, such as BERT and XLNet, and state-of-the-art training schemas in deep learning, such as pretraining and fine-tuning schema, and multitask learning. Overall, the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data being GE sentence pairs. Conclusions The 2019 n2c2/OHNLP ClinicalSTS shared task focused on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. The ClinicalSTS shared task could continue to serve as a venue for researchers in natural language processing and medical informatics communities to develop and improve semantic textual similarity techniques for clinical text.


2020 ◽  
pp. 3-17
Author(s):  
Peter Nabende

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.


Author(s):  
Mario Jojoa Acosta ◽  
Gema Castillo-Sánchez ◽  
Begonya Garcia-Zapirain ◽  
Isabel de la Torre Díez ◽  
Manuel Franco-Martín

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.


AERA Open ◽  
2021 ◽  
Vol 7 ◽  
pp. 233285842110286
Author(s):  
Kylie L. Anglin ◽  
Vivian C. Wong ◽  
Arielle Boguslav

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.


2021 ◽  
Author(s):  
Monique B. Sager ◽  
Aditya M. Kashyap ◽  
Mila Tamminga ◽  
Sadhana Ravoori ◽  
Christopher Callison-Burch ◽  
...  

BACKGROUND Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to inappropriate care. Initial testing has shown that artificially intelligent bots can detect misinformation on Reddit forums and may be able to produce responses to posts containing misinformation. OBJECTIVE To analyze the ability of bots to find and respond to health misinformation on Reddit’s dermatology forums in a controlled test environment. METHODS Using natural language processing techniques, we trained bots to target misinformation using relevant keywords and to post pre-fabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. RESULTS Our models yielded data test accuracies ranging from 95%-100%, with a BERT fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective pre-fabricated responses to misinformation. CONCLUSIONS Using a limited data set, bots had near-perfect ability to detect these examples of health misinformation within Reddit dermatology forums. Given that these bots can then post pre-fabricated responses, this technique may allow for interception of misinformation. Providing correct information, even instantly, however, does not mean users will be receptive or find such interventions persuasive. Further work should investigate this strategy’s effectiveness to inform future deployment of bots as a technique in combating health misinformation. CLINICALTRIAL N/A


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Craig H Ganoe ◽  
Weiyi Wu ◽  
Paul J Barr ◽  
William Haslett ◽  
Michelle D Dannenberg ◽  
...  

Abstract Objectives The objective of this study is to build and evaluate a natural language processing approach to identify medication mentions in primary care visit conversations between patients and physicians. Materials and Methods Eight clinicians contributed to a data set of 85 clinic visit transcripts, and 10 transcripts were randomly selected from this data set as a development set. Our approach utilizes Apache cTAKES and Unified Medical Language System controlled vocabulary to generate a list of medication candidates in the transcribed text and then performs multiple customized filters to exclude common false positives from this list while including some additional common mentions of the supplements and immunizations. Results Sixty-five transcripts with 1121 medication mentions were randomly selected as an evaluation set. Our proposed method achieved an F-score of 85.0% for identifying the medication mentions in the test set, significantly outperforming existing medication information extraction systems for medical records with F-scores ranging from 42.9% to 68.9% on the same test set. Discussion Our medication information extraction approach for primary care visit conversations showed promising results, extracting about 27% more medication mentions from our evaluation set while eliminating many false positives in comparison to existing baseline systems. We made our approach publicly available on the web as an open-source software. Conclusion Integration of our annotation system with clinical recording applications has the potential to improve patients’ understanding and recall of key information from their clinic visits, and, in turn, to positively impact health outcomes.


10.2196/27386 ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. e27386
Author(s):  
Qingyu Chen ◽  
Alex Rankine ◽  
Yifan Peng ◽  
Elaheh Aghaarabi ◽  
Zhiyong Lu

Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.


Sign in / Sign up

Export Citation Format

Share Document