scholarly journals Image Cationing with Visual-Semantic LSTM

Author(s):  
Nannan Li ◽  
Zhenzhong Chen

In this paper, a novel image captioning approach is proposed to describe the content of images. Inspired by the visual processing of our cognitive system, we propose a visual-semantic LSTM model to locate the attention objects with their low-level features in the visual cell, and then successively extract high-level semantic features in the semantic cell. In addition, a state perturbation term is introduced to the word sampling strategy in the REINFORCE based method to explore proper vocabularies in the training process. Experimental results on MS COCO and Flickr30K validate the effectiveness of our approach when compared to the state-of-the-art methods.

Author(s):  
Zhihao Fan ◽  
Zhongyu Wei ◽  
Siyuan Wang ◽  
Ruize Wang ◽  
Zejun Li ◽  
...  

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-5
Author(s):  
Huafeng Chen ◽  
Maosheng Zhang ◽  
Zhengming Gao ◽  
Yunhong Zhao

Current methods of chaos-based action recognition in videos are limited to the artificial feature causing the low recognition accuracy. In this paper, we improve ChaosNet to the deep neural network and apply it to action recognition. First, we extend ChaosNet to deep ChaosNet for extracting action features. Then, we send the features to the low-level LSTM encoder and high-level LSTM encoder for obtaining low-level coding output and high-level coding results, respectively. The agent is a behavior recognizer for producing recognition results. The manager is a hidden layer, responsible for giving behavioral segmentation targets at the high level. Our experiments are executed on two standard action datasets: UCF101 and HMDB51. The experimental results show that the proposed algorithm outperforms the state of the art.


2011 ◽  
Vol 26 (10) ◽  
pp. 612-627 ◽  
Author(s):  
Hyun-seok Min ◽  
Jae Young Choi ◽  
Wesley De Neve ◽  
Yong Man Ro

2018 ◽  
Vol 2018 ◽  
pp. 1-11 ◽  
Author(s):  
Hai Wang ◽  
Lei Dai ◽  
Yingfeng Cai ◽  
Long Chen ◽  
Yong Zhang

Traditional salient object detection models are divided into several classes based on low-level features and contrast between pixels. In this paper, we propose a model based on a multilevel deep pyramid (MLDP), which involves fusing multiple features on different levels. Firstly, the MLDP uses the original image as the input for a VGG16 model to extract high-level features and form an initial saliency map. Next, the MLDP further extracts high-level features to form a saliency map based on a deep pyramid. Then, the MLDP obtains the salient map fused with superpixels by extracting low-level features. After that, the MLDP applies background noise filtering to the saliency map fused with superpixels in order to filter out the interference of background noise and form a saliency map based on the foreground. Lastly, the MLDP combines the saliency map fused with the superpixels with the saliency map based on the foreground, which results in the final saliency map. The MLDP is not limited to low-level features while it fuses multiple features and achieves good results when extracting salient targets. As can be seen in our experiment section, the MLDP is better than the other 7 state-of-the-art models across three different public saliency datasets. Therefore, the MLDP has superiority and wide applicability in extraction of salient targets.


Author(s):  
Nicolas Bougie ◽  
Ryutaro Ichise

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.


Author(s):  
Monika Singh ◽  
Anand Singh Singh Jalal ◽  
Ruchira Manke ◽  
Aamir Khan

Saliency detection has always been a challenging and interesting research area for researchers. The existing methodologies either focus on foreground regions or background regions of an image by computing low-level features. However, considering only low-level features did not produce worthy results. In this paper, low-level features, which are extracted using super pixels, are embodied with high-level priors. The background features are assumed as the low-level prior due to the similarity in the background areas and boundary of an image which are interconnected and have minimum distance in between them. High-level priors such as location, color, and semantic prior are incorporated with low-level prior to spotlight the salient area in the image. The experimental results illustrate that the proposed approach outperform the sate-of-the-art methods.


2018 ◽  
Vol 6 ◽  
pp. 421-435 ◽  
Author(s):  
Yan Shao ◽  
Christian Hardmeier ◽  
Joakim Nivre

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.


Author(s):  
Weichun Liu ◽  
Xiaoan Tang ◽  
Chenglin Zhao

Recently, deep trackers based on the siamese networking are enjoying increasing popularity in the tracking community. Generally, those trackers learn a high-level semantic embedding space for feature representation but lose low-level fine-grained details. Meanwhile, the learned high-level semantic features are not updated during online tracking, which results in tracking drift in presence of target appearance variation and similar distractors. In this paper, we present a novel end-to-end trainable Convolutional Neural Network (CNN) based on the siamese network for distractor-aware tracking. It enhances target appearance representation in both the offline training stage and online tracking stage. In the offline training stage, this network learns both the low-level fine-grained details and high-level coarse-grained semantics simultaneously in a multi-task learning framework. The low-level features with better resolution are complementary to semantic features and able to distinguish the foreground target from background distractors. In the online stage, the learned low-level features are fed into a correlation filter layer and updated in an interpolated manner to encode target appearance variation adaptively. The learned high-level features are fed into a cross-correlation layer without online update. Therefore, the proposed tracker benefits from both the adaptability of the fine-grained correlation filter and the generalization capability of the semantic embedding. Extensive experiments are conducted on the public OTB100 and UAV123 benchmark datasets. Our tracker achieves state-of-the-art performance while running with a real-time frame-rate.


2020 ◽  
Vol 34 (4) ◽  
pp. 571-584
Author(s):  
Rajarshi Biswas ◽  
Michael Barz ◽  
Daniel Sonntag

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.


Sign in / Sign up

Export Citation Format

Share Document