Image Cationing with Visual-Semantic LSTM

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/110 ◽

2018 ◽

Cited By ~ 6

Author(s):

Nannan Li ◽

Zhenzhong Chen

Keyword(s):

Visual Processing ◽

State Of The Art ◽

Sampling Strategy ◽

Experimental Results ◽

Visual Cell ◽

Semantic Features ◽

Training Process ◽

Image Captioning ◽

Low Level ◽

High Level

In this paper, a novel image captioning approach is proposed to describe the content of images. Inspired by the visual processing of our cognitive system, we propose a visual-semantic LSTM model to locate the attention objects with their low-level features in the visual cell, and then successively extract high-level semantic features in the semantic cell. In addition, a state perturbation term is introduced to the word sampling strategy in the REINFORCE based method to explore proper vocabularies in the training process. Experimental results on MS COCO and Flickr30K validate the effectiveness of our approach when compared to the state-of-the-art methods.

Download Full-text

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/91 ◽

2021 ◽

Author(s):

Zhihao Fan ◽

Zhongyu Wei ◽

Siyuan Wang ◽

Ruize Wang ◽

Zejun Li ◽

...

Keyword(s):

State Of The Art ◽

Representation Learning ◽

Experimental Results ◽

Text Representation ◽

Image Captioning ◽

Scene Graph ◽

Low Level ◽

Language And Vision ◽

High Level ◽

Cross Language

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.

Download Full-text

Deep ChaosNet for Action Recognition in Videos

Complexity ◽

10.1155/2021/6634156 ◽

2021 ◽

Vol 2021 ◽

pp. 1-5

Author(s):

Huafeng Chen ◽

Maosheng Zhang ◽

Zhengming Gao ◽

Yunhong Zhao

Keyword(s):

Neural Network ◽

Action Recognition ◽

Deep Neural Network ◽

Recognition Accuracy ◽

State Of The Art ◽

Experimental Results ◽

Low Level ◽

Hidden Layer ◽

High Level ◽

Standard Action

Current methods of chaos-based action recognition in videos are limited to the artificial feature causing the low recognition accuracy. In this paper, we improve ChaosNet to the deep neural network and apply it to action recognition. First, we extend ChaosNet to deep ChaosNet for extracting action features. Then, we send the features to the low-level LSTM encoder and high-level LSTM encoder for obtaining low-level coding output and high-level coding results, respectively. The agent is a behavior recognizer for producing recognition results. The manager is a hidden layer, responsible for giving behavioral segmentation targets at the high level. Our experiments are executed on two standard action datasets: UCF101 and HMDB51. The experimental results show that the proposed algorithm outperforms the state of the art.

Download Full-text

Bimodal fusion of low-level visual features and high-level semantic features for near-duplicate video clip detection

Signal Processing Image Communication ◽

10.1016/j.image.2011.04.001 ◽

2011 ◽

Vol 26 (10) ◽

pp. 612-627 ◽

Cited By ~ 2

Author(s):

Hyun-seok Min ◽

Jae Young Choi ◽

Wesley De Neve ◽

Yong Man Ro

Keyword(s):

Video Clip ◽

Visual Features ◽

Semantic Features ◽

Low Level ◽

High Level ◽

Duplicate Video

Download Full-text

Saliency Detection by Multilevel Deep Pyramid Model

Journal of Sensors ◽

10.1155/2018/8249180 ◽

2018 ◽

Vol 2018 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Hai Wang ◽

Lei Dai ◽

Yingfeng Cai ◽

Long Chen ◽

Yong Zhang

Keyword(s):

Background Noise ◽

State Of The Art ◽

Saliency Detection ◽

Saliency Map ◽

Multiple Features ◽

Low Level ◽

Pyramid Model ◽

High Level ◽

Different Levels ◽

Better Than

Traditional salient object detection models are divided into several classes based on low-level features and contrast between pixels. In this paper, we propose a model based on a multilevel deep pyramid (MLDP), which involves fusing multiple features on different levels. Firstly, the MLDP uses the original image as the input for a VGG16 model to extract high-level features and form an initial saliency map. Next, the MLDP further extracts high-level features to form a saliency map based on a deep pyramid. Then, the MLDP obtains the salient map fused with superpixels by extracting low-level features. After that, the MLDP applies background noise filtering to the saliency map fused with superpixels in order to filter out the interference of background noise and form a saliency map based on the foreground. Lastly, the MLDP combines the saliency map fused with the superpixels with the saliency map based on the foreground, which results in the final saliency map. The MLDP is not limited to low-level features while it fuses multiple features and achieves good results when extracting salient targets. As can be seen in our experiment section, the MLDP is better than the other 7 state-of-the-art models across three different public saliency datasets. Therefore, the MLDP has superiority and wide applicability in extraction of salient targets.

Download Full-text

Towards High-Level Intrinsic Exploration in Reinforcement Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/733 ◽

2020 ◽

Author(s):

Nicolas Bougie ◽

Ryutaro Ichise

Keyword(s):

Reinforcement Learning ◽

Time Horizon ◽

State Of The Art ◽

Experimental Results ◽

Prior Work ◽

Extrinsic Rewards ◽

Intrinsic Reward ◽

Long Time ◽

End To End ◽

High Level

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.

Download Full-text

A Visual Saliency Detection Approach by Fusing Low-Level Priors With High-Level Priors

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2019070102 ◽

2019 ◽

Vol 9 (3) ◽

pp. 23-37

Author(s):

Monika Singh ◽

Anand Singh Singh Jalal ◽

Ruchira Manke ◽

Aamir Khan

Keyword(s):

Minimum Distance ◽

Saliency Detection ◽

Visual Saliency ◽

Research Area ◽

Experimental Results ◽

Low Level ◽

Detection Approach ◽

Art Methods ◽

High Level ◽

Visual Saliency Detection

Saliency detection has always been a challenging and interesting research area for researchers. The existing methodologies either focus on foreground regions or background regions of an image by computing low-level features. However, considering only low-level features did not produce worthy results. In this paper, low-level features, which are extracted using super pixels, are embodied with high-level priors. The background features are assumed as the low-level prior due to the similarity in the background areas and boundary of an image which are interconnected and have minimum distance in between them. High-level priors such as location, color, and semantic prior are incorporated with low-level prior to spotlight the salient area in the image. The experimental results illustrate that the proposed approach outperform the sate-of-the-art methods.

Download Full-text

Universal Word Segmentation: Implementation and Interpretation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00033 ◽

2018 ◽

Vol 6 ◽

pp. 421-435 ◽

Cited By ~ 3

Author(s):

Yan Shao ◽

Christian Hardmeier ◽

Joakim Nivre

Keyword(s):

State Of The Art ◽

Word Segmentation ◽

Experimental Results ◽

Word Boundary ◽

Writing Systems ◽

Low Level ◽

Wide Range ◽

Segmentation Accuracy ◽

Small Set

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

Download Full-text

Distractor-Aware Tracking with Multi-Task and Dynamic Feature Learning

Journal of Circuits System and Computers ◽

10.1142/s0218126621500316 ◽

2020 ◽

pp. 2150031

Author(s):

Weichun Liu ◽

Xiaoan Tang ◽

Chenglin Zhao

Keyword(s):

Correlation Filter ◽

Coarse Grained ◽

Dynamic Feature ◽

Semantic Features ◽

Low Level ◽

Fine Grained ◽

Semantic Embedding ◽

Training Stage ◽

Online Tracking ◽

High Level

Recently, deep trackers based on the siamese networking are enjoying increasing popularity in the tracking community. Generally, those trackers learn a high-level semantic embedding space for feature representation but lose low-level fine-grained details. Meanwhile, the learned high-level semantic features are not updated during online tracking, which results in tracking drift in presence of target appearance variation and similar distractors. In this paper, we present a novel end-to-end trainable Convolutional Neural Network (CNN) based on the siamese network for distractor-aware tracking. It enhances target appearance representation in both the offline training stage and online tracking stage. In the offline training stage, this network learns both the low-level fine-grained details and high-level coarse-grained semantics simultaneously in a multi-task learning framework. The low-level features with better resolution are complementary to semantic features and able to distinguish the foreground target from background distractors. In the online stage, the learned low-level features are fed into a correlation filter layer and updated in an interpolated manner to encode target appearance variation adaptively. The learned high-level features are fed into a cross-correlation layer without online update. Therefore, the proposed tracker benefits from both the adaptability of the fine-grained correlation filter and the generalization capability of the semantic embedding. Extensive experiments are conducted on the public OTB100 and UAV123 benchmark datasets. Our tracker achieves state-of-the-art performance while running with a real-time frame-rate.

Download Full-text

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

KI - Künstliche Intelligenz ◽

10.1007/s13218-020-00679-2 ◽

2020 ◽

Vol 34 (4) ◽

pp. 571-584

Author(s):

Rajarshi Biswas ◽

Michael Barz ◽

Daniel Sonntag

Keyword(s):

State Of The Art ◽

Input Image ◽

The State ◽

Beam Search ◽

Image Captioning ◽

Bottom Up ◽

Interactive Machine Learning ◽

Joint Embedding ◽

Bounding Boxes ◽

High Level

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Download Full-text

Informative subspaces for audio-visual processing: high-level function from low-level fusion

IEEE International Conference on Acoustics Speech and Signal Processing ◽

10.1109/icassp.2002.1004821 ◽

2002 ◽

Cited By ~ 1

Author(s):

Fisher ◽

Darrell

Keyword(s):

Visual Processing ◽

Low Level ◽

Level Function ◽

High Level ◽

Level Fusion

Download Full-text