scholarly journals Step-Wise Hierarchical Alignment Network for Image-Text Matching

Author(s):  
Zhong Ji ◽  
Kexin Chen ◽  
Haoran Wang

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

Author(s):  
Wenzhe Wang ◽  
Mengdan Zhang ◽  
Runnan Chen ◽  
Guanyu Cai ◽  
Penghao Zhou ◽  
...  

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.


2020 ◽  
Vol 34 (07) ◽  
pp. 11109-11116 ◽  
Author(s):  
Pin Jiang ◽  
Yahong Han

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.


2021 ◽  
Vol 2050 (1) ◽  
pp. 012006
Author(s):  
Xili Dai ◽  
Chunmei Ma ◽  
Jingwei Sun ◽  
Tao Zhang ◽  
Haigang Gong ◽  
...  

Abstract Training deep neural networks from only a few examples has been an interesting topic that motivated few shot learning. In this paper, we study the fine-grained image classification problem in a challenging few-shot learning setting, and propose the Self-Amplificated Network (SAN), a method based on meta-learning to tackle this problem. The SAN model consists of three parts, which are the Encoder, Amplification and Similarity Modules. The Encoder Module encodes a fine-grained image input into a feature vector. The Amplification Module is used to amplify subtle differences between fine-grained images based on the self attention mechanism which is composed of multi-head attention. The Similarity Module measures how similar the query image and the support set are in order to determine the classification result. In-depth experiments on three benchmark datasets have showcased that our network achieves superior performance over the competing baselines.


2022 ◽  
Vol 22 (3) ◽  
pp. 1-21
Author(s):  
Prayag Tiwari ◽  
Amit Kumar Jaiswal ◽  
Sahil Garg ◽  
Ilsun You

Self-attention mechanisms have recently been embraced for a broad range of text-matching applications. Self-attention model takes only one sentence as an input with no extra information, i.e., one can utilize the final hidden state or pooling. However, text-matching problems can be interpreted either in symmetrical or asymmetrical scopes. For instance, paraphrase detection is an asymmetrical task, while textual entailment classification and question-answer matching are considered asymmetrical tasks. In this article, we leverage attractive properties of self-attention mechanism and proposes an attention-based network that incorporates three key components for inter-sequence attention: global pointwise features, preceding attentive features, and contextual features while updating the rest of the components. Our model follows evaluation on two benchmark datasets cover tasks of textual entailment and question-answer matching. The proposed efficient Self-attention-driven Network for Text Matching outperforms the state of the art on the Stanford Natural Language Inference and WikiQA datasets with much fewer parameters.


Author(s):  
Cunxiao Du ◽  
Zhaozheng Chen ◽  
Fuli Feng ◽  
Lei Zhu ◽  
Tian Gan ◽  
...  

Text classification is one of the fundamental tasks in natural language processing. Recently, deep neural networks have achieved promising performance in the text classification task compared to shallow models. Despite of the significance of deep models, they ignore the fine-grained (matching signals between words and classes) classification clues since their classifications mainly rely on the text-level representations. To address this problem, we introduce the interaction mechanism to incorporate word-level matching signals into the text classification task. In particular, we design a novel framework, EXplicit interAction Model (dubbed as EXAM), equipped with the interaction mechanism. We justified the proposed approach on several benchmark datasets including both multilabel and multi-class text classification tasks. Extensive experimental results demonstrate the superiority of the proposed method. As a byproduct, we have released the codes and parameter settings to facilitate other researches.


Author(s):  
Yi Tay ◽  
Anh Tuan Luu ◽  
Siu Cheung Hui

Co-Attentions are highly effective attention mechanisms for text matching applications. Co-Attention enables the learning of pairwise attentions, i.e., learning to attend based on computing word-level affinity scores between two documents. However, text matching problems can exist in either symmetrical or asymmetrical domains. For example, paraphrase identification is a symmetrical task while question-answer matching and entailment classification are considered asymmetrical domains. In this paper, we argue that Co-Attention models in asymmetrical domains require different treatment as opposed to symmetrical domains, i.e., a concept of word-level directionality should be incorporated while learning word-level similarity scores. Hence, the standard inner product in real space commonly adopted in co-attention is not suitable. This paper leverages attractive properties of the complex vector space and proposes a co-attention mechanism based on the complex-valued inner product (Hermitian products). Unlike the real dot product, the dot product in complex space is asymmetric because the first item is conjugated. Aside from modeling and encoding directionality, our proposed approach also enhances the representation learning process. Extensive experiments on five text matching benchmark datasets demonstrate the effectiveness of our approach. 


Author(s):  
Xu Duan ◽  
Jingzheng Wu ◽  
Shouling Ji ◽  
Zhiqing Rui ◽  
Tianyue Luo ◽  
...  

With the explosive development of information technology, vulnerabilities have become one of the major threats to computer security. Most vulnerabilities with similar patterns can be detected effectively by static analysis methods. However, some vulnerable and non-vulnerable code is hardly distinguishable, resulting in low detection accuracy. In this paper, we define the accurate identification of vulnerabilities in similar code as a fine-grained vulnerability detection problem. We propose VulSniper which is designed to detect fine-grained vulnerabilities more effectively. In VulSniper, attention mechanism is used to capture the critical features of the vulnerabilities. Especially, we use bottom-up and top-down structures to learn the attention weights of different areas of the program. Moreover, in order to fully extract the semantic features of the program, we generate the code property graph, design a 144-dimensional vector to describe the relation between the nodes, and finally encode the program as a feature tensor. VulSniper achieves F1-scores of 80.6% and 73.3% on the two benchmark datasets, the SARD Buffer Error dataset and the SARD Resource Management Error dataset respectively, which are significantly higher than those of the state-of-the-art methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Hongwei Zhao ◽  
Danyang Zhang ◽  
Jiaxin Wu ◽  
Pingping Liu

Fine-grained retrieval is one of the complex problems in computer vision. Compared with general content-based image retrieval, fine-grained image retrieval faces more difficult challenges. In fine-grained image retrieval tasks, all classes belong to a subclass of a meta-class, so there will be small interclass variance and large intraclass variance. In order to solve this problem, in this paper, we propose a fine-grained retrieval method to improve loss and feature aggregation, which can achieve better retrieval results under a unified framework. Firstly, we propose a novel multiproxies adaptive distribution loss which can better characterize the intraclass variations and the degree of dispersion of each cluster center. Secondly, we propose a weakly supervised feature aggregation method based on channel weighting, which distinguishes the importance of different feature channels to obtain more representative image feature descriptors. We verify the performance of our proposed method on the universal benchmark datasets such as CUB200-2011 and Stanford Dog. Higher Recall@K demonstrates the advantage of our proposed method over the state of the art.


2021 ◽  
Vol 13 (17) ◽  
pp. 3484
Author(s):  
Jie Wan ◽  
Zhong Xie ◽  
Yongyang Xu ◽  
Ziyin Zeng ◽  
Ding Yuan ◽  
...  

Feature extraction on point clouds is an essential task when analyzing and processing point clouds of 3D scenes. However, there still remains a challenge to adequately exploit local fine-grained features on point cloud data due to its irregular and unordered structure in a 3D space. To alleviate this problem, a Dilated Graph Attention-based Network (DGANet) with a certain feature for learning ability is proposed. Specifically, we first build a local dilated graph-like region for each input point to establish the long-range spatial correlation towards its corresponding neighbors, which allows the proposed network to access a wider range of geometric information of local points with their long-range dependencies. Moreover, by integrating the dilated graph attention module (DGAM) implemented by a novel offset–attention mechanism, the proposed network promises to highlight the differing ability of each edge of the constructed local graph to uniquely learn the discrepancy feature of geometric attributes between the connected point pairs. Finally, all the learned edge attention features are further aggregated, allowing the most significant geometric feature representation of local regions by the graph–attention pooling to fully extract local detailed features for each point. The validation experiments using two challenging benchmark datasets demonstrate the effectiveness and powerful generation ability of our proposed DGANet in both 3D object classification and segmentation tasks.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Guangyao Pang ◽  
Keda Lu ◽  
Xiaoying Zhu ◽  
Jie He ◽  
Zhiyi Mo ◽  
...  

With the rapid development of Internet social platforms, buyer shows (such as comment text) have become an important basis for consumers to understand products and purchase decisions. The early sentiment analysis methods were mainly text-level and sentence-level, which believed that a text had only one sentiment. This phenomenon will cover up the details, and it is difficult to reflect people’s fine-grained and comprehensive sentiments fully, leading to people’s wrong decisions. Obviously, aspect-level sentiment analysis can obtain a more comprehensive sentiment classification by mining the sentiment tendencies of different aspects in the comment text. However, the existing aspect-level sentiment analysis methods mainly focus on attention mechanism and recurrent neural network. They lack emotional sensitivity to the position of aspect words and tend to ignore long-term dependencies. In order to solve this problem, on the basis of Bidirectional Encoder Representations from Transformers (BERT), this paper proposes an effective aspect-level sentiment analysis approach (ALM-BERT) by constructing an aspect feature location model. Specifically, we use the pretrained BERT model first to mine more aspect-level auxiliary information from the comment context. Secondly, for the sake of learning the expression features of aspect words and the interactive information of aspect words’ context, we construct an aspect-based sentiment feature extraction method. Finally, we construct evaluation experiments on three benchmark datasets. The experimental results show that the aspect-level sentiment analysis performance of the ALM-BERT approach proposed in this paper is significantly better than other comparison methods.


Sign in / Sign up

Export Citation Format

Share Document