Spectral Learning of Semantic Units in a Sentence Pair to Evaluate Semantic Textual Similarity

BACKGROUND Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations. OBJECTIVE We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models. METHODS Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model’s loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model. RESULTS Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model’s loss, we identified the system’s failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)–style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set. CONCLUSIONS We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model’s outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these “black box” models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP.

Download Full-text

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis

JMIR Medical Informatics ◽

10.2196/23099 ◽

2021 ◽

Vol 9 (5) ◽

pp. e23099

Author(s):

Mark Ormerod ◽

Jesús Martínez del Rincón ◽

Barry Devereux

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Ground Truth ◽

Similarity Score ◽

General Tendency ◽

Sentence Pair ◽

Data Set ◽

Semantic Textual Similarity

Background Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations. Objective We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models. Methods Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model’s loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model. Results Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model’s loss, we identified the system’s failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)–style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set. Conclusions We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model’s outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these “black box” models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP.

Download Full-text

Semantic Classification of Scientific Sentence Pair Using Recurrent Neural Network

2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI) ◽

10.23919/eecsi50503.2020.9251897 ◽

2020 ◽

Author(s):

Agung Besti ◽

Ridwan Ilyas ◽

Fatan Kasyidi ◽

Esmeralda Contessa Djamal

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Sentence Pair ◽

Semantic Classification ◽

Scientific Sentence

Download Full-text

Meerkat Mafia: Multilingual and Cross-Level Semantic Textual Similarity Systems

10.3115/v1/s14-2072 ◽

2014 ◽

Cited By ~ 7

Author(s):

Abhay Kashyap ◽

Lushan Han ◽

Roberto Yus ◽

Jennifer Sleeman ◽

Taneeya Satyapanich ◽

...

Keyword(s):

Semantic Textual Similarity

Download Full-text

Semantic Textual Similarity using Machine Learning and Conceptual Relatedness

SSRN Electronic Journal ◽

10.2139/ssrn.3576366 ◽

2020 ◽

Author(s):

Shivam Varshney ◽

Priyanka Sharma ◽

Hira Javed

Keyword(s):

Machine Learning ◽

Semantic Textual Similarity

Download Full-text

Semantic Textual Similarity of Sentences with Emojis

Companion Proceedings of the Web Conference 2020 ◽

10.1145/3366424.3383758 ◽

2020 ◽

Cited By ~ 1

Author(s):

Alok Debnath ◽

Nikhil Pinnaparaju ◽

Manish Shrivastava ◽

Vasudeva Varma ◽

Isabelle Augenstein

Keyword(s):

Semantic Textual Similarity

Download Full-text

Evaluating semantic textual similarity in clinical sentences using deep learning and sentence embeddings

Proceedings of the 35th Annual ACM Symposium on Applied Computing ◽

10.1145/3341105.3373987 ◽

2020 ◽

Cited By ~ 1

Author(s):

Rui Antunes ◽

João Figueira Silva ◽

Sérgio Matos

Keyword(s):

Deep Learning ◽

Semantic Textual Similarity

Download Full-text

Self-spectral learning with GAN based spectral-spatial target detection for hyperspectral image

Neural Networks ◽

10.1016/j.neunet.2021.05.029 ◽

2021 ◽

Author(s):

Weiying Xie ◽

Jiaqing Zhang ◽

Jie Lei ◽

Yunsong Li ◽

Xiuping Jia

Keyword(s):

Target Detection ◽

Hyperspectral Image ◽

Spectral Learning

Download Full-text

Technological troubleshooting based on sentence embedding with deep transformers

Journal of Intelligent Manufacturing ◽

10.1007/s10845-021-01797-w ◽

2021 ◽

Author(s):

Antonio L. Alfeo ◽

Mario G. C. A. Cimino ◽

Gigliola Vaglini

Keyword(s):

Technical Assistance ◽

State Of The Art ◽

Semantic Context ◽

Retrieval Performance ◽

Private Company ◽

Textual Data ◽

Context Knowledge ◽

Development And Management ◽

Preparation Module ◽

Semantic Textual Similarity

AbstractIn nowadays manufacturing, each technical assistance operation is digitally tracked. This results in a huge amount of textual data that can be exploited as a knowledge base to improve these operations. For instance, an ongoing problem can be addressed by retrieving potential solutions among the ones used to cope with similar problems during past operations. To be effective, most of the approaches for semantic textual similarity need to be supported by a structured semantic context (e.g. industry-specific ontology), resulting in high development and management costs. We overcome this limitation with a textual similarity approach featuring three functional modules. The data preparation module provides punctuation and stop-words removal, and word lemmatization. The pre-processed sentences undergo the sentence embedding module, based on Sentence-BERT (Bidirectional Encoder Representations from Transformers) and aimed at transforming the sentences into fixed-length vectors. Their cosine similarity is processed by the scoring module to match the expected similarity between the two original sentences. Finally, this similarity measure is employed to retrieve the most suitable recorded solutions for the ongoing problem. The effectiveness of the proposed approach is tested (i) against a state-of-the-art competitor and two well-known textual similarity approaches, and (ii) with two case studies, i.e. private company technical assistance reports and a benchmark dataset for semantic textual similarity. With respect to the state-of-the-art, the proposed approach results in comparable retrieval performance and significantly lower management cost: 30-min questionnaires are sufficient to obtain the semantic context knowledge to be injected into our textual search engine.

Download Full-text

Semantic Textual Similarity in Bengali Text

2018 International Conference on Bangla Speech and Language Processing (ICBSLP) ◽

10.1109/icbslp.2018.8554940 ◽

2018 ◽

Cited By ~ 2

Author(s):

Md Shajalal ◽

Masaki Aono

Keyword(s):

Semantic Textual Similarity

Download Full-text