Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.

Download Full-text

Weakly Supervised Fine-Grained Image Classification via Salient Region Localization and Different Layer Feature Fusion

Applied Sciences ◽

10.3390/app10134652 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4652

Author(s):

Fangxiong Chen ◽

Guoheng Huang ◽

Jiaying Lan ◽

Yanhui Wu ◽

Chi-Man Pun ◽

...

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Classification Performance ◽

Training Data ◽

Classification Model ◽

Global Features ◽

Salient Region ◽

Fine Grained ◽

Proposed Model ◽

Weakly Supervised

The fine-grained image classification task is about differentiating between different object classes. The difficulties of the task are large intra-class variance and small inter-class variance. For this reason, improving models’ accuracies on the task heavily relies on discriminative parts’ annotations and regional parts’ annotations. Such delicate annotations’ dependency causes the restriction on models’ practicability. To tackle this issue, a saliency module based on a weakly supervised fine-grained image classification model is proposed by this article. Through our salient region localization module, the proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. Besides, the bilinear attention module can improve the performance on feature extraction by using higher- and lower-level layers of the network to fuse regional features with global features. With the application of the bilinear attention architecture, we propose the different layer feature fusion module to improve the expression ability of model features. We tested and verified our model on public datasets released specifically for fine-grained image classification. The results of our test show that our proposed model can achieve close to state-of-the-art classification performance on various datasets, while only the least training data are provided. Such a result indicates that the practicality of our model is incredibly improved since fine-grained image datasets are expensive.

Download Full-text

Multi-Level Joint Feature Learning for Person Re-Identification

Algorithms ◽

10.3390/a13050111 ◽

2020 ◽

Vol 13 (5) ◽

pp. 111

Author(s):

Shaojun Wu ◽

Ling Gao

Keyword(s):

Deep Learning ◽

Feature Fusion ◽

Feature Learning ◽

Local Features ◽

Image Features ◽

Learning Networks ◽

Fusion Model ◽

Global Features ◽

Multi Level ◽

High Level

In person re-identification, extracting image features is an important step when retrieving pedestrian images. Most of the current methods only extract global features or local features of pedestrian images. Some inconspicuous details are easily ignored when learning image features, which is not efficient or robust to for scenarios with large differences. In this paper, we propose a Multi-level Feature Fusion model that combines both global features and local features of images through deep learning networks to generate more discriminative pedestrian descriptors. Specifically, we extract local features from different depths of network by the Part-based Multi-level Net to fuse low-to-high level local features of pedestrian images. Global-Local Branches are used to extract the local features and global features at the highest level. The experiments have proved that our deep learning model based on multi-level feature fusion works well in person re-identification. The overall results outperform the state of the art with considerable margins on three widely-used datasets. For instance, we achieve 96% Rank-1 accuracy on the Market-1501 dataset and 76.1% mAP on the DukeMTMC-reID dataset, outperforming the existing works by a large margin (more than 6%).

Download Full-text

Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification

Applied Sciences ◽

10.3390/app9091939 ◽

2019 ◽

Vol 9 (9) ◽

pp. 1939 ◽

Cited By ~ 5

Author(s):

Yadong Yang ◽

Xiaofeng Wang ◽

Quan Zhao ◽

Tingting Sui

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Recognition Rate ◽

Fine Tuning ◽

Semantic Features ◽

Convolutional Network ◽

Attention Model ◽

Fine Grained ◽

Visual Attention Mechanism ◽

High Level

The focus of fine-grained image classification tasks is to ignore interference information and grasp local features. This challenge is what the visual attention mechanism excels at. Firstly, we have constructed a two-level attention convolutional network, which characterizes the object-level attention and the pixel-level attention. Then, we combine the two kinds of attention through a second-order response transform algorithm. Furthermore, we propose a clustering-based grouping attention model, which implies the part-level attention. The grouping attention method is to stretch all the semantic features, in a deeper convolution layer of the network, into vectors. These vectors are clustered by a vector dot product, and each category represents a special semantic. The grouping attention algorithm implements the functions of group convolution and feature clustering, which can greatly reduce the network parameters and improve the recognition rate and interpretability of the network. Finally, the low-level visual features and high-level semantic information are merged by a multi-level feature fusion method to accurately classify fine-grained images. We have achieved good results without using pre-training networks and fine-tuning techniques.

Download Full-text

Object–Part Registration–Fusion Net for Fine-Grained Image Classification

Symmetry ◽

10.3390/sym13101838 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1838

Author(s):

Chih-Wei Lin ◽

Mengxiang Lin ◽

Jinfu Liu

Keyword(s):

Feature Fusion ◽

Bird Species ◽

Image Understanding ◽

Local Features ◽

Feature Maps ◽

Global Features ◽

Fine Grained ◽

Object Part ◽

Global And Local ◽

Fusion Feature

Classifying fine-grained categories (e.g., bird species, car, and aircraft types) is a crucial problem in image understanding and is difficult due to intra-class and inter-class variance. Most of the existing fine-grained approaches individually utilize various parts and local information of objects to improve the classification accuracy but neglect the mechanism of the feature fusion between the object (global) and object’s parts (local) to reinforce fine-grained features. In this paper, we present a novel framework, namely object–part registration–fusion Net (OR-Net), which considers the mechanism of registration and fusion between an object (global) and its parts’ (local) features for fine-grained classification. Our model learns the fine-grained features from the object of global and local regions and fuses these features with the registration mechanism to reinforce each region’s characteristics in the feature maps. Precisely, OR-Net consists of: (1) a multi-stream feature extraction net, which generates features with global and various local regions of objects; (2) a registration–fusion feature module calculates the dimension and location relationships between global (object) regions and local (parts) regions to generate the registration information and fuses the local features into the global features with registration information to generate the fine-grained feature. Experiments execute symmetric GPU devices with symmetric mini-batch to verify that OR-Net surpasses the state-of-the-art approaches on CUB-200-2011 (Birds), Stanford-Cars, and Stanford-Aircraft datasets.

Download Full-text

Predicting Implicit User Preferences with Multimodal Feature Fusion for Similar User Recommendation in Social Media

Applied Sciences ◽

10.3390/app11031064 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1064

Author(s):

Jenq-Haur Wang ◽

Yen-Tsang Wu ◽

Long Wang

Keyword(s):

Social Media ◽

Feature Fusion ◽

Relevant Information ◽

Image Features ◽

User Preferences ◽

User Preference ◽

Late Fusion ◽

Multimodal Features ◽

Fusion Methods ◽

Text Features

In social networks, users can easily share information and express their opinions. Given the huge amount of data posted by many users, it is difficult to search for relevant information. In addition to individual posts, it would be useful if we can recommend groups of people with similar interests. Past studies on user preference learning focused on single-modal features such as review contents or demographic information of users. However, such information is usually not easy to obtain in most social media without explicit user feedback. In this paper, we propose a multimodal feature fusion approach to implicit user preference prediction which combines text and image features from user posts for recommending similar users in social media. First, we use the convolutional neural network (CNN) and TextCNN models to extract image and text features, respectively. Then, these features are combined using early and late fusion methods as a representation of user preferences. Lastly, a list of users with the most similar preferences are recommended. The experimental results on real-world Instagram data show that the best performance can be achieved when we apply late fusion of individual classification results for images and texts, with the best average top-k accuracy of 0.491. This validates the effectiveness of utilizing deep learning methods for fusing multimodal features to represent social user preferences. Further investigation is needed to verify the performance in different types of social media.

Download Full-text

A challenge for historical research: Making data FAIR using a collaborative ontology management environment (OntoME)

Semantic Web ◽

10.3233/sw-200416 ◽

2020 ◽

pp. 1-16

Author(s):

Francesco Beretta

Keyword(s):

Semantic Analysis ◽

Historical Research ◽

Data Modelling ◽

Fine Grained ◽

Fair Principles ◽

Active Research ◽

New Research ◽

High Level ◽

Conceptual Data ◽

Ontology Management

This paper addresses the issue of interoperability of data generated by historical research and heritage institutions in order to make them re-usable for new research agendas according to the FAIR principles. After introducing the symogih.org project’s ontology, it proposes a description of the essential aspects of the process of historical knowledge production. It then develops an epistemological and semantic analysis of conceptual data modelling applied to factual historical information, based on the foundational ontologies Constructive Descriptions and Situations and DOLCE, and discusses the reasons for adopting the CIDOC CRM as a core ontology for the field of historical research, but extending it with some relevant, missing high-level classes. Finally, it shows how collaborative data modelling carried out in the ontology management environment OntoME makes it possible to elaborate a communal fine-grained and adaptive ontology of the domain, provided an active research community engages in this process. With this in mind, the Data for history consortium was founded in 2017 and promotes the adoption of a shared conceptualization in the field of historical research.

Download Full-text

One-IPC high-level simulation of microthreaded many-core architectures

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015584495 ◽

2016 ◽

Vol 31 (2) ◽

pp. 152-162 ◽

Cited By ~ 3

Author(s):

Irfan Uddin

Keyword(s):

Design Space Exploration ◽

Instruction Set ◽

Efficient Design ◽

Simulation Framework ◽

Fine Grained ◽

Detailed Simulation ◽

High Level ◽

Many Core ◽

The Cost ◽

Multiple Clusters

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.

Download Full-text

Fine-grained Classification of Malicious Code Based on CNN and Multi-resolution Feature Fusion

10.1109/iccia52886.2021.00031 ◽

2021 ◽

Author(s):

Junmiao Liang ◽

Zhenhu Ning ◽

Yihua Zhou ◽

Dongzhi Cao

Keyword(s):

Feature Fusion ◽

Malicious Code ◽

Fine Grained

Download Full-text

Stories as Technology: Past, Present, and Future

Seeds of Science ◽

10.53975/wlv3-sr8m ◽

2021 ◽

Author(s):

Roger’s Bacon ◽

Sergey Samsonau ◽

Dario Krpan ◽

◽

Keyword(s):

Social Cognitive ◽

Fine Grained ◽

Good Story ◽

High Level ◽

Future Technologies

What is it about a good story that causes it to have life-changing effects on one person and not another? I wonder if future technologies will enable us to develop the type of truly deep and fine-grained understanding of stories as social, cognitive, and emotional technologies that might allow us to answer this question with a high-level of precision.

Download Full-text

A Lightweight and Fine-Grained Feature Fusion Network for Remote Sensing Scene Classification

10.1109/icspcc52875.2021.9564476 ◽

2021 ◽

Author(s):

Lin Bai ◽

Qingxin Liu ◽

Cuiling Li ◽

Zhen Ye ◽

Meng Hui

Keyword(s):

Remote Sensing ◽

Feature Fusion ◽

Scene Classification ◽

Fine Grained

Download Full-text