scholarly journals Multi-View Attention Network for Visual Dialog

2021 ◽  
Vol 11 (7) ◽  
pp. 3009
Author(s):  
Sungjin Park ◽  
Taesun Whang ◽  
Yeochan Yoon ◽  
Heuiseok Lim

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, and image) is required. Specifically, it is necessary for an agent to (1) determine the semantic intent of question and (2) align question-relevant textual and visual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules (i.e., Topic Aggregation and Context Matching), and builds multimodal representations through sequential alignment processes (i.e., Modality Alignment). Experimental results on VisDial v1.0 dataset show the effectiveness of our proposed model, which outperforms previous state-of-the-art methods under both single model and ensemble settings.

Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1882
Author(s):  
Cheng Yang ◽  
Weijia Wu ◽  
Yuxing Wang ◽  
Hong Zhou

Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.


2020 ◽  
Vol 29 (15) ◽  
pp. 2050250
Author(s):  
Xiongfei Liu ◽  
Bengao Li ◽  
Xin Chen ◽  
Haiyan Zhang ◽  
Shu Zhan

This paper proposes a novel method for person image generation with arbitrary target pose. Given a person image and an arbitrary target pose, our proposed model can synthesize images with the same person but different poses. The Generative Adversarial Networks (GANs) are the major part of the proposed model. Different from the traditional GANs, we add attention mechanism to the generator in order to generate realistic-looking images, we also use content reconstruction with a pretrained VGG16 Net to keep the content consistency between generated images and target images. Furthermore, we test our model on DeepFashion and Market-1501 datasets. The experimental results show that the proposed network performs favorably against state-of-the-art methods.


2021 ◽  
Author(s):  
Lu Ren ◽  
Hongfei Lin ◽  
Bo Xu ◽  
Shaowu Zhang ◽  
Liang Yang ◽  
...  

BACKGROUND As a common mental disease, depression seriously affects people’s physical and mental health. According to the statistics of the World Health Organization, depression is one of the main reasons for suicide and self-harm events in the world. Therefore, strengthening depression detection can effectively reduce the occurrence of suicide or self-harm events so as to save more people and families. With the development of computer technology, some researchers are trying to apply natural language processing techniques to detect people who are depressed automatically. Many existing feature engineering methods for depression detection are based on emotional characteristics, but these methods do not consider high-level emotional semantic information. The current deep learning methods for depression detection cannot accurately extract effective emotional semantic information. OBJECTIVE In this paper, we propose an emotion-based attention network, including a semantic understanding network and an emotion understanding network, which can capture the high-level emotional semantic information effectively to improve the depression detection task. METHODS The semantic understanding network module is used to capture the contextual semantic information. The emotion understanding network module is used to capture the emotional semantic information. There are two units in the emotion understanding network module, including a positive emotion understanding unit and a negative emotion understanding unit, which are used to capture the positive emotional information and the negative emotional information, respectively. We further proposed a dynamic fusion strategy in the emotion understanding network module to fuse the positive emotional information and the negative emotional information. RESULTS We evaluated our method on the Reddit data set. The experimental results showed that the proposed emotion-based attention network model achieved an accuracy, precision, recall, and F-measure of 91.30%, 91.91%, 96.15%, and 93.98%, respectively, which are comparable results compared with state-of-the-art methods. CONCLUSIONS The experimental results showed that our model is competitive with the state-of-the-art models. The semantic understanding network module, the emotion understanding network module, and the dynamic fusion strategy are effective modules for depression detection. In addition, the experimental results verified that the emotional semantic information was effective in depression detection.


10.2196/28754 ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. e28754
Author(s):  
Lu Ren ◽  
Hongfei Lin ◽  
Bo Xu ◽  
Shaowu Zhang ◽  
Liang Yang ◽  
...  

Background As a common mental disease, depression seriously affects people’s physical and mental health. According to the statistics of the World Health Organization, depression is one of the main reasons for suicide and self-harm events in the world. Therefore, strengthening depression detection can effectively reduce the occurrence of suicide or self-harm events so as to save more people and families. With the development of computer technology, some researchers are trying to apply natural language processing techniques to detect people who are depressed automatically. Many existing feature engineering methods for depression detection are based on emotional characteristics, but these methods do not consider high-level emotional semantic information. The current deep learning methods for depression detection cannot accurately extract effective emotional semantic information. Objective In this paper, we propose an emotion-based attention network, including a semantic understanding network and an emotion understanding network, which can capture the high-level emotional semantic information effectively to improve the depression detection task. Methods The semantic understanding network module is used to capture the contextual semantic information. The emotion understanding network module is used to capture the emotional semantic information. There are two units in the emotion understanding network module, including a positive emotion understanding unit and a negative emotion understanding unit, which are used to capture the positive emotional information and the negative emotional information, respectively. We further proposed a dynamic fusion strategy in the emotion understanding network module to fuse the positive emotional information and the negative emotional information. Results We evaluated our method on the Reddit data set. The experimental results showed that the proposed emotion-based attention network model achieved an accuracy, precision, recall, and F-measure of 91.30%, 91.91%, 96.15%, and 93.98%, respectively, which are comparable results compared with state-of-the-art methods. Conclusions The experimental results showed that our model is competitive with the state-of-the-art models. The semantic understanding network module, the emotion understanding network module, and the dynamic fusion strategy are effective modules for depression detection. In addition, the experimental results verified that the emotional semantic information was effective in depression detection.


Author(s):  
Nicolas Bougie ◽  
Ryutaro Ichise

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.


2020 ◽  
pp. 1-24
Author(s):  
Dequan Jin ◽  
Ziyan Qin ◽  
Murong Yang ◽  
Penghe Chen

We propose a novel neural model with lateral interaction for learning tasks. The model consists of two functional fields: an elementary field to extract features and a high-level field to store and recognize patterns. Each field is composed of some neurons with lateral interaction, and the neurons in different fields are connected by the rules of synaptic plasticity. The model is established on the current research of cognition and neuroscience, making it more transparent and biologically explainable. Our proposed model is applied to data classification and clustering. The corresponding algorithms share similar processes without requiring any parameter tuning and optimization processes. Numerical experiments validate that the proposed model is feasible in different learning tasks and superior to some state-of-the-art methods, especially in small sample learning, one-shot learning, and clustering.


2019 ◽  
Vol 9 (18) ◽  
pp. 3908 ◽  
Author(s):  
Jintae Kim ◽  
Shinhyeok Oh ◽  
Oh-Woog Kwon ◽  
Harksoo Kim

To generate proper responses to user queries, multi-turn chatbot models should selectively consider dialogue histories. However, previous chatbot models have simply concatenated or averaged vector representations of all previous utterances without considering contextual importance. To mitigate this problem, we propose a multi-turn chatbot model in which previous utterances participate in response generation using different weights. The proposed model calculates the contextual importance of previous utterances by using an attention mechanism. In addition, we propose a training method that uses two types of Wasserstein generative adversarial networks to improve the quality of responses. In experiments with the DailyDialog dataset, the proposed model outperformed the previous state-of-the-art models based on various performance measures.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2038
Author(s):  
Xi Shao ◽  
Xuan Zhang ◽  
Guijin Tang ◽  
Bingkun Bao

We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention Network (RMAN) model, which performs object-based scene classification by recurrently locating and memorizing objects in the image. Based on the proposed framework, we introduce a multi-task mechanism that contiguously attends on the different essential objects in a scene image and recurrently performs memory fusion of the features of object focused by an attention model to improve the scene recognition accuracy. The experimental results show that the RMAN model has achieved better classification performance on the constructed dataset and two public scene datasets, surpassing state-of-the-art image scene recognition approaches.


2020 ◽  
Vol 10 (7) ◽  
pp. 2421
Author(s):  
Bencheng Yan ◽  
Chaokun Wang ◽  
Gaoyang Guo

Recently, graph neural networks (GNNs) have achieved great success in dealing with graph-based data. The basic idea of GNNs is iteratively aggregating the information from neighbors, which is a special form of Laplacian smoothing. However, most of GNNs fall into the over-smoothing problem, i.e., when the model goes deeper, the learned representations become indistinguishable. This reflects the inability of the current GNNs to explore the global graph structure. In this paper, we propose a novel graph neural network to address this problem. A rejection mechanism is designed to address the over-smoothing problem, and a dilated graph convolution kernel is presented to capture the high-level graph structure. A number of experimental results demonstrate that the proposed model outperforms the state-of-the-art GNNs, and can effectively overcome the over-smoothing problem.


2020 ◽  
Vol 34 (05) ◽  
pp. 9749-9756
Author(s):  
Junnan Zhu ◽  
Yu Zhou ◽  
Jiajun Zhang ◽  
Haoran Li ◽  
Chengqing Zong ◽  
...  

Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation and the image selection. Due to the lack of multimodal reference data, we present two strategies, i.e., ROUGE-ranking and Order-ranking, to construct the multimodal reference by extending the text reference. Meanwhile, to better evaluate multimodal outputs, we propose a novel evaluation metric based on joint multimodal representation, projecting the model output and multimodal reference into a joint semantic space during evaluation. Experimental results have shown that our proposed model achieves the new state-of-the-art on both automatic and manual evaluation metrics. Besides, our proposed evaluation method can effectively improve the correlation with human judgments.


Sign in / Sign up

Export Citation Format

Share Document