scholarly journals Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

2022 ◽  
Vol 40 (4) ◽  
pp. 1-27
Author(s):  
Zhongwei Xie ◽  
Ling Liu ◽  
Yanzhao Wu ◽  
Luo Zhong ◽  
Lin Li

This article introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using Word2vec. We leverage Wide ResNet50 and Word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2576
Author(s):  
Alessandro Masullo ◽  
Tilo Burghardt ◽  
Dima Damen ◽  
Toby Perrett ◽  
Majid Mirmehdi

The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Wentao Wei ◽  
Xuhui Hu ◽  
Hua Liu ◽  
Ming Zhou ◽  
Yan Song

As a machine-learning-driven decision-making problem, the surface electromyography (sEMG)-based hand movement recognition is one of the key issues in robust control of noninvasive neural interfaces such as myoelectric prosthesis and rehabilitation robot. Despite the recent success in sEMG-based hand movement recognition using end-to-end deep feature learning technologies based on deep learning models, the performance of today’s sEMG-based hand movement recognition system is still limited by the noisy, random, and nonstationary nature of sEMG signals and researchers have come up with a number of methods that improve sEMG-based hand movement via feature engineering. Aiming at achieving higher sEMG-based hand movement recognition accuracies while enabling a trade-off between performance and computational complexity, this study proposed a progressive fusion network (PFNet) framework, which improves sEMG-based hand movement recognition via integration of domain knowledge-guided feature engineering and deep feature learning. In particular, it learns high-level feature representations from raw sEMG signals and engineered time-frequency domain features via a feature learning network and a domain knowledge network, respectively, and then employs a 3-stage progressive fusion strategy to progressively fuse the two networks together and obtain the final decisions. Extensive experiments were conducted on five sEMG datasets to evaluate our proposed PFNet, and the experimental results showed that the proposed PFNet could achieve the average hand movement recognition accuracies of 87.8%, 85.4%, 68.3%, 71.7%, and 90.3% on the five datasets, respectively, which outperformed those achieved by the state of the arts.


Author(s):  
Shanshan Wang ◽  
Lei Zhang

Existing adversarial domain adaptation methods mainly consider the marginal distribution and these methods may lead to either under transfer or negative transfer. To address this problem, we present a self-adaptive re-weighted adversarial domain adaptation approach, which tries to enhance domain alignment from the perspective of conditional distribution. In order to promote positive transfer and combat negative transfer, we reduce the weight of the adversarial loss for aligned features while increasing the adversarial force for those poorly aligned measured by the conditional entropy. Additionally, triplet loss leveraging source samples and pseudo-labeled target samples is employed on the confusing domain. Such metric loss ensures the distance of the intra-class sample pairs closer than the inter-class pairs to achieve the class-level alignment. In this way, the high accurate pseudolabeled target samples and semantic alignment can be captured simultaneously in the co-training process. Our method achieved low joint error of the ideal source and target hypothesis. The expected target error can then be upper bounded following Ben-David’s theorem. Empirical evidence demonstrates that the proposed model outperforms state of the arts on standard domain adaptation datasets.


Author(s):  
Yao Yang ◽  
Haoran Chen ◽  
Junming Shao

Deep autoencoder is widely used in dimensionality reduction because of the expressive power of the neural network. Therefore, it is naturally suitable for embedding tasks, which essentially compresses high-dimensional information into a low-dimensional latent space. In terms of network representation, methods based on autoencoder such as SDNE and DNGR have achieved comparable results with the state-of-arts. However, all of them do not leverage label information, which leads to the embeddings lack the characteristic of discrimination. In this paper, we present Triplet Enhanced AutoEncoder (TEA), a new deep network embedding approach from the perspective of metric learning. Equipped with the triplet-loss constraint, the proposed approach not only allows capturing the topological structure but also preserving the discriminative information. Moreover, unlike existing discriminative embedding techniques, TEA is independent of any specific classifier, we call it the model-free property. Extensive empirical results on three public datasets (i.e, Cora, Citeseer and BlogCatalog) show that TEA is stable and achieves state-of-the-art performance compared with both supervised and unsupervised network embedding approaches on various percentages of labeled data. The source code can be obtained from https://github.com/yybeta/TEA.


Sensors ◽  
2020 ◽  
Vol 20 (1) ◽  
pp. 291 ◽  
Author(s):  
Pingping Liu ◽  
Guixia Gou ◽  
Xue Shan ◽  
Dan Tao ◽  
Qiuzhan Zhou

A rich line of works focus on designing elegant loss functions under the deep metric learning (DML) paradigm to learn a discriminative embedding space for remote sensing image retrieval (RSIR). Essentially, such embedding space could efficiently distinguish deep feature descriptors. So far, most existing losses used in RSIR are based on triplets, which have disadvantages of local optimization, slow convergence and insufficient use of similarity structure in a mini-batch. In this paper, we present a novel DML method named as global optimal structured loss to deal with the limitation of triplet loss. To be specific, we use a softmax function rather than a hinge function in our novel loss to realize global optimization. In addition, we present a novel optimal structured loss, which globally learn an efficient deep embedding space with mined informative sample pairs to force the positive pairs within a limitation and push the negative ones far away from a given boundary. We have conducted extensive experiments on four public remote sensing datasets and the results show that the proposed global optimal structured loss with pairs mining scheme achieves the state-of-the-art performance compared with the baselines.


2021 ◽  
Author(s):  
Zhongwei Xie ◽  
Ling Liu ◽  
Lin Li ◽  
Luo Zhong
Keyword(s):  

2017 ◽  
Vol 25 (10) ◽  
pp. 1942-1955 ◽  
Author(s):  
Yanmin Qian ◽  
Nanxin Chen ◽  
Heinrich Dinkel ◽  
Zhizheng Wu

Sign in / Sign up

Export Citation Format

Share Document