A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Abstract Recently, multimodal representation learning for images and other information such as numbers or language has gained much attention due to the possibility of combining latent features using a single distribution. The aim of the current study was to analyze the diagnostic performance of deep multimodal representation model-based integration of tumor image, patient background, and blood biomarkers for the differentiation of liver tumors observed using B-mode ultrasonography (US). First, we applied supervised learning with a convolutional neural network (CNN) to 972 liver nodules in the training and development sets (479 benign and 493 malignant nodules), to develop a predictive model using segmented B-mode tumor images. Additionally, we also applied a deep multimodal representation model to integrate information about patient background or blood biomarkers to B-mode images. We then investigated the performance of the models in an independent test set of 108 liver nodules, including 53 benign and 55 malignant tumors. Using only the segmented B-mode images, the diagnostic accuracy and area under the curve (AUC) values were 68.52% and 0.721, respectively. As the information about patient background such as age or sex and blood biomarkers was integrated, the diagnostic performance increased in a stepwise manner. The diagnostic accuracy and AUC value of the multimodal DL model (which integrated B-mode tumor image, patient age, sex, AST, ALT, platelet count, and albumin data) reached 96.30% and 0.994, respectively. Integration of patient background and blood biomarkers in addition to US image using multimodal representation learning outperformed the CNN model using US images. We expect that the deep multimodal representation model could be a feasible and acceptable tool that can effectively support the definitive diagnosis of liver tumors using B-mode US in daily clinical practice.

Download Full-text

An Applicative Survey on Few-Shot Learning

Recent Patents on Engineering ◽

10.2174/1872212115666210715121344 ◽

2021 ◽

Vol 15 ◽

Author(s):

Jianwei Zhang ◽

Xubin Zhang ◽

Lei Lv ◽

Yining Di ◽

Wei Chen

Keyword(s):

Large Scale ◽

Representation Learning ◽

Language Models ◽

Data Sets ◽

Research Directions ◽

Large Scale Data ◽

Cross Domain ◽

Meta Learning ◽

Definition Of ◽

Future Work

Background: Learning discriminative representation from large-scale data sets has made a breakthrough in decades. However, it is still a thorny problem to generate representative embedding from limited examples, for example, a class containing only one image. Recently, deep learning-based Few-Shot Learning (FSL) has been proposed. It tackles this problem by leveraging prior knowledge in various ways. Objective: In this work, we review recent advances of FSL from the perspective of high-dimensional representation learning. The results of the analysis can provide insights and directions for future work. Methods: We first present the definition of general FSL. Then we propose a general framework for the FSL problem and give the taxonomy under the framework. We survey two FSL directions: learning policy and meta-learning. Results: We review the advanced applications of FSL, including image classification, object detection, image segmentation and other tasks etc., as well as the corresponding benchmarks to provide an overview of recent progress. Conclusion: FSL needs to be further studied in medical images, language models, and reinforcement learning in future work. In addition, cross-domain FSL, successive FSL, and associated FSL are more challenging and valuable research directions.

Download Full-text