scholarly journals ALBERT-based Self-ensemble Model with Semi-supervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study (Preprint)

2020 ◽  
Author(s):  
Junyi Li ◽  
Xuejie Zhang ◽  
Xiaobing Zhou

BACKGROUND In recent years, with the increase in the amount of information and the importance of information screening, increasing attention has been paid to the calculation of textual semantic similarity. In the medical field, with the rapid increase in electronic medical data, electronic medical records and medical research documents have become important data resources for medical clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved. The 2019 N2C2/OHNLP shared task Track on Clinical Semantic Textual Similarity is one of significant tasks for medical textual semantic similarity calculation. OBJECTIVE This research aims to solve two problems: 1) The size of medical datasets is small, which leads to the problem of insufficient learning with understanding of the models; 2) The data information will be lost in the process of long-distance propagation, which causes the models to be unable to grasp key information. METHODS This paper combines a text data augmentation method and a self-ensemble ALBERT model under semi-supervised learning to perform clinical textual semantic similarity calculation. RESULTS Compared with the competition methods the 2019 N2C2/OHNLP Track 1 ClinicalSTS, our method achieves state-of-the-art result with a value 0.92 of the Pearson correlation coefficient and surpasses the best result by 2 percentage point. CONCLUSIONS When the size of medical dataset is small, data augmentation and improved semi-supervised learning can increase the size of dataset and boost the learning efficiency of the model. Additionally, self-ensemble improves the model performance significantly. Through the results, we can know that our method has excellent performance and it has great potential to improve related medical problems. CLINICALTRIAL

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Sebastian Otálora ◽  
Niccolò Marini ◽  
Henning Müller ◽  
Manfredo Atzori

Abstract Background One challenge to train deep convolutional neural network (CNNs) models with whole slide images (WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of transfer learning in a CNN model when different data sources are available for training or how to leverage from the combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong annotations in heterogeneous data sources. The trade-off between classification performance and annotation effort is explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset with less expensive weak (image-level) labels. Results As expected, the model performance on strongly annotated data steadily increases as the percentage of strong annotations that are used increases, reaching a performance comparable to pathologists ($$\kappa = 0.691 \pm 0.02$$ κ = 0.691 ± 0.02 ). Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with $$\kappa = 0.307 \pm 0.133$$ κ = 0.307 ± 0.133 . Moreover, it only provides a lower performance regardless of the number of annotations used. The model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels $$\kappa = 0.528 \pm 0.05$$ κ = 0.528 ± 0.05 . Conclusion Combining weak and strong supervision improves strong supervision in classification of Gleason patterns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN models combining few annotated data and heterogeneous data sources. The performance increases in the controlled TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered when the trained TMA model is applied directly to the more challenging WSI classification problem. This demonstrates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the experiments in the paper: https://github.com/ilmaro8/Digital_Pathology_Transfer_Learning


2020 ◽  
Author(s):  
Ying Xiong ◽  
Shuai Chen ◽  
Qingcai Chen ◽  
Jun Yan ◽  
Buzhou Tang

BACKGROUND With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.


2021 ◽  
Vol 11 (16) ◽  
pp. 7188
Author(s):  
Tieming Chen ◽  
Yunpeng Chen ◽  
Mingqi Lv ◽  
Gongxun He ◽  
Tiantian Zhu ◽  
...  

Malicious HTTP traffic detection plays an important role in web application security. Most existing work applies machine learning and deep learning techniques to build the malicious HTTP traffic detection model. However, they still suffer from the problems of huge training data collection cost and low cross-dataset generalization ability. Aiming at these problems, this paper proposes DeepPTSD, a deep learning method for payload based malicious HTTP traffic detection. First, it treats the malicious HTTP traffic detection as a text classification problem and trains the initial detection model using TextCNN on a public dataset, and then adapts the initial detection model to the target dataset based on a transfer learning algorithm. Second, in the transfer learning procedure, it uses a semi-supervised learning algorithm to accomplish the model adaptation task. The semi-supervised learning algorithm enhances the target dataset based on a HTTP payload data augmentation mechanism to exploit both the labeled and unlabeled data. We evaluate DeepPTSD on two real HTTP traffic datasets. The results show that DeepPTSD has competitive performance under the small data condition.


Stroke ◽  
2020 ◽  
Vol 51 (Suppl_1) ◽  
Author(s):  
Vitor Mendes Pereira ◽  
Yoni Donner ◽  
Gil Levi ◽  
Nicole Cancelliere ◽  
Erez Wasserman ◽  
...  

Cerebral Aneurysms (CAs) may occur in 5-10% of the population. They can be often missed because they require a very methodological diagnostic approach. We developed an algorithm using artificial intelligence to assist and supervise and detect CAs. Methods: We developed an automated algorithm to detect CAs. The algorithm is based on 3D convolutional neural network modeled as a U-net. We included all saccular CAs from 2014 to 2016 from a single center. Normal and pathological datasets were prepared and annotated in 3D using an in-house developed platform. To assess the accuracy and to optimize the model, we assessed preliminary results using a validation dataset. After the algorithm was trained, a dataset was used to evaluate final IA detection and aneurysm measurements. The accuracy of the algorithm was derived using ROC curves and Pearson correlation tests. Results: We used 528 CTAs with 674 aneurysms at the following locations: ACA (3%), ACA/ACOM (26.1%), ICA/MCA (26.3%), MCA (29.4%), PCA/PCOM (2.3%), Basilar (6.6%), Vertebral (2.3%) and other (3.7%). Training datasets consisted of 189 CA scans. We plotted ROC curves and achieved an AUC of 0.85 for unruptured and 0.88 for ruptured CAs. We improved the model performance by increasing the training dataset employing various methods of data augmentation to leverage the data to its fullest. The final model tested was performed in 528 CTAs using 5-fold cross-validation and an additional set of 2400 normal CTAs. There was a significant improvement compared to the initial assessment, with an AUC of 0.93 for unruptured and 0.94 for ruptured. The algorithm detected larger aneurysms more accurately, reaching an AUC of 0.97 and a 91.5% specificity at 90% sensitivity for aneurysms larger than 7mm. Also, the algorithm accurately detected CAs in the following locations: basilar(AUC of 0.97) and MCA/ACOM (AUC of 0.94). The volume measurement (mm3) by the model compared to the annotated one achieved a Pearson correlation of 99.36. Conclusion: The Viz.ai aneurysm algorithm was able to detect and measure ruptured and unruptured CAs in consecutive CTAs. The model has demonstrated that a deep learning AI algorithm can achieve clinically useful levels of accuracy for clinical decision support.


2020 ◽  
Author(s):  
David Chang ◽  
Eric Lin ◽  
Cynthia Brandt ◽  
Richard Andrew Taylor

BACKGROUND While electronic health record systems have facilitated clinical documentation in healthcare, they also introduce new challenges such as the proliferation of redundant information through copy-and-paste commands or templates. One approach to trim down bloated clinical documentation and improve clinical summarization is to identify highly similar text snippets for the goal of removing such text. OBJECTIVE We develop a natural language processing system for the task of clinical semantic textual similarity that assigns scores to pairs of clinical text snippets based on their clinical semantic similarity. METHODS We leverage recent advances in natural language processing and graph representation learning to create a model that combines linguistic and domain knowledge information from the MedSTS dataset to assess clinical semantic textual similarity. We use Bidirectional Encoder Representation from Transformers (BERT)¬–based models as text encoders for the sentence pairs in the dataset and graph convolutional networks (GCNs) as graph encoders for corresponding concept graphs constructed based on the sentences. We also explore techniques including data augmentation, ensembling, and knowledge distillation to improve the performance as measured by Pearson correlation. RESULTS Fine–tuning BERT-base and ClinicalBERT on the MedSTS dataset provided a strong baseline (0.842 and 0.848 Pearson correlation, respectively) compared to the previous year’s submissions. Our data augmentation techniques yielded moderate gains in performance, and adding a GCN–based graph encoder to incorporate the concept graphs also boosted performance, especially when the node features were initialized with pretrained knowledge graph embeddings of the concepts (0.868). As expected, ensembling improved performance, and multi–source ensembling using different language model variants, conducting knowledge distillation on the multi–source ensemble model, and taking a final ensemble of the distilled models further improved the system’s performance (0.875, 0.878, and 0.882, respectively). CONCLUSIONS We develop a system for the MedSTS clinical semantic textual similarity benchmark task by combining BERT–based text encoders and GCN–based graph encoders in order to incorporate domain knowledge into the natural language processing pipeline. We also experiment with other techniques involving data augmentation, pretrained concept embeddings, ensembling, and knowledge distillation to further increase our performance.


10.2196/23357 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e23357
Author(s):  
Ying Xiong ◽  
Shuai Chen ◽  
Qingcai Chen ◽  
Jun Yan ◽  
Buzhou Tang

Background With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. Objective In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. Methods We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). Results We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). Conclusions Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.


2014 ◽  
Vol 6 (2) ◽  
pp. 46-51
Author(s):  
Galang Amanda Dwi P. ◽  
Gregorius Edwadr ◽  
Agus Zainal Arifin

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group.  Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.


2019 ◽  
Author(s):  
Chin Lin ◽  
Yu-Sheng Lou ◽  
Chia-Cheng Lee ◽  
Chia-Jung Hsu ◽  
Ding-Chung Wu ◽  
...  

BACKGROUND An artificial intelligence-based algorithm has shown a powerful ability for coding the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) in discharge notes. However, its performance still requires improvement compared with human experts. The major disadvantage of the previous algorithm is its lack of understanding medical terminologies. OBJECTIVE We propose some methods based on human-learning process and conduct a series of experiments to validate their improvements. METHODS We compared two data sources for training the word-embedding model: English Wikipedia and PubMed journal abstracts. Moreover, the fixed, changeable, and double-channel embedding tables were used to test their performance. Some additional tricks were also applied to improve accuracy. We used these methods to identify the three-chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. Subsequently, 94,483-labeled discharge notes from June 1, 2015 to June 30, 2017 were used from the Tri-Service General Hospital in Taipei, Taiwan. To evaluate performance, 24,762 discharge notes from July 1, 2017 to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from other seven hospitals were also tested. The F-measure is the major global measure of effectiveness. RESULTS In understanding medical terminologies, the PubMed-embedding model (Pearson correlation = 0.60/0.57) shows a better performance compared with the Wikipedia-embedding model (Pearson correlation = 0.35/0.31). In the accuracy of ICD-10-CM coding, the changeable model both used the PubMed- and Wikipedia-embedding model has the highest testing mean F-measure (0.7311 and 0.6639 in Tri-Service General Hospital and other seven hospitals, respectively). Moreover, a proposed method called a hybrid sampling method, an augmentation trick to avoid algorithms identifying negative terms, was found to additionally improve the model performance. CONCLUSIONS The proposed model architecture and training method is named as ICD10Net, which is the first expert level model practically applied to daily work. This model can also be applied in unstructured information extraction from free-text medical writing. We have developed a web app to demonstrate our work (https://linchin.ndmctsgh.edu.tw/app/ICD10/).


Sign in / Sign up

Export Citation Format

Share Document