Motion Guided Spatial Attention for Video Captioning

Sequence-to-sequence models incorporated with attention mechanism have shown promising improvements on video captioning. While there is rich information both inside and between frames, spatial attention is rarely explored and motion information is usually handled by 3D-CNNs as just another modality for fusion. On the other hand, researches about human perception suggest that apparent motion can attract attention. Motivated by this, we aim to learn spatial attention on video frames under the guidance of motion information for caption generation. We present a novel video captioning framework by utilizing Motion Guided Spatial Attention (MGSA). The proposed MGSA exploits the motion between video frames by learning spatial attention from stacked optical flow images with a custom CNN. To further relate the spatial attention maps of video frames, we designed a Gated Attention Recurrent Unit (GARU) to adaptively incorporate previous attention maps. The whole framework can be trained in an end-to-end manner. We evaluate our approach on two benchmark datasets, MSVD and MSR-VTT. The experiments show that our designed model can generate better video representation and state of the art results are obtained under popular evaluation metrics such as BLEU@4, CIDEr, and METEOR.

Download Full-text

Learning Soft Mask Based Feature Fusion with Channel and Spatial Attention for Robust Visual Object Tracking

Sensors ◽

10.3390/s20144021 ◽

2020 ◽

Vol 20 (14) ◽

pp. 4021 ◽

Cited By ~ 2

Author(s):

Mustansar Fiaz ◽

Arif Mahmood ◽

Soon Ki Jung

Keyword(s):

Object Tracking ◽

Spatial Attention ◽

Feature Fusion ◽

State Of The Art ◽

Feature Representation ◽

Visual Object ◽

Target Feature ◽

Visual Object Tracking ◽

Low Level ◽

Benchmark Datasets

We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.

Download Full-text

Bilateral Multi-Perspective Matching for Natural Language Sentences

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/579 ◽

2017 ◽

Cited By ~ 92

Author(s):

Zhiguo Wang ◽

Wael Hamza ◽

Radu Florian

Keyword(s):

Natural Language ◽

State Of The Art ◽

The State ◽

Experimental Results ◽

The Other ◽

Multiple Perspectives ◽

Time Step ◽

Benchmark Datasets ◽

Sentence Matching ◽

Fully Connected

Natural language sentence matching is a fundamental technology for a variety of tasks. Previous approaches either match sentences from a single direction or only apply single granular (word-by-word or sentence-by-sentence) matching. In this work, we propose a bilateral multi-perspective matching (BiMPM) model. Given two sentences P and Q, our model first encodes them with a BiLSTM encoder. Next, we match the two encoded sentences in two directions P against Q and P against Q. In each matching direction, each time step of one sentence is matched against all time-steps of the other sentence from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fix-length matching vector. Finally, based on the matching vector, a decision is made through a fully connected layer. We evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.

Download Full-text

Aspect Term Extraction with History Attention and Selective Transformation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/583 ◽

2018 ◽

Cited By ~ 19

Author(s):

Xin Li ◽

Lidong Bing ◽

Piji Li ◽

Wai Lam ◽

Zhimou Yang

Keyword(s):

State Of The Art ◽

The Other ◽

Coordinate Structure ◽

Selective Transformation ◽

Term Extraction ◽

History Information ◽

Online User Reviews ◽

Benchmark Datasets ◽

Input Sentence ◽

New Framework

Aspect Term Extraction (ATE), a key sub-task in Aspect-Based Sentiment Analysis, aims to extract explicit aspect expressions from online user reviews. We present a new framework for tackling ATE. It can exploit two useful clues, namely opinion summary and aspect detection history. Opinion summary is distilled from the whole input sentence, conditioned on each current token for aspect prediction, and thus the tailor-made summary can help aspect prediction on this token. On the other hand, the aspect detection history information is distilled from the previous aspect predictions, and it can leverage the coordinate structure and tagging schema constraints to upgrade the aspect prediction. Experimental results over four benchmark datasets clearly demonstrate that our framework can outperform all state-of-the-art methods.

Download Full-text

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/88 ◽

2020 ◽

Author(s):

Tao Jin ◽

Siyu Huang ◽

Ming Chen ◽

Yingming Li ◽

Zhongfei Zhang

Keyword(s):

State Of The Art ◽

Multimodal Interaction ◽

Multimodal Learning ◽

Local Correlation ◽

Learning Problem ◽

Generation Task ◽

Language Generation ◽

Video Captioning ◽

Benchmark Datasets ◽

Novel Method

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Download Full-text

Color-Sensitive Person Re-Identification

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/131 ◽

2019 ◽

Cited By ~ 2

Author(s):

Guan'an Wang ◽

Yang Yang ◽

Jian Cheng ◽

Jinqiao Wang ◽

Zengguang Hou

Keyword(s):

Full Advantage ◽

State Of The Art ◽

Color Difference ◽

The Other ◽

Semantic Features ◽

Color Information ◽

Benchmark Datasets ◽

Identity Consistency ◽

High Level ◽

Sensitive Person

Recent deep Re-ID models mainly focus on learning high-level semantic features, while failing to explicitly explore color information which is one of the most important cues for person Re-ID. In this paper, we propose a novel Color-Sensitive Re-ID to take full advantage of color information. On one hand, we train our model with real and fake images. By using the extra fake images, more color information can be exploited and it can avoid overfitting during training. On the other hand, we also train our model with images of the same person with different colors. By doing so, features can be forced to focus on the color difference in regions. To generate fake images with specified colors, we propose a novel Color Translation GAN (CTGAN) to learn mappings between different clothing colors and preserve identity consistency among the same clothing color. Extensive evaluations on two benchmark datasets show that our approach significantly outperforms state-of-the-art Re-ID models.

Download Full-text

Hierarchical Knowledge Squeezed Adversarial Network Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6799 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11370-11377

Author(s):

Peng Li ◽

Chang Shu ◽

Yuan Xie ◽

Yan Qu ◽

Hui Kong

Keyword(s):

State Of The Art ◽

Teacher Student ◽

Adversarial Network ◽

Benchmark Datasets ◽

Knowledge Distillation ◽

Adversarial Training ◽

Rich Information ◽

Process Oriented ◽

Transfer Method ◽

Network Compression

Deep network compression has been achieved notable progress via knowledge distillation, where a teacher-student learning manner is adopted by using predetermined loss. Recently, more focuses have been transferred to employ the adversarial training to minimize the discrepancy between distributions of output from two networks. However, they always emphasize on result-oriented learning while neglecting the scheme of process-oriented learning, leading to the loss of rich information contained in the whole network pipeline. Whereas in other (non GAN-based) process-oriented methods, the knowledge have usually been transferred in a redundant manner. Observing that, the small network can not perfectly mimic a large one due to the huge gap of network scale, we propose a knowledge transfer method, involving effective intermediate supervision, under the adversarial training framework to learn the student network. Different from the other intermediate supervision methods, we design the knowledge representation in a compact form by introducing a task-driven attention mechanism. Meanwhile, to improve the representation capability of the attention-based method, a hierarchical structure is utilized so that powerful but highly squeezed knowledge is realized and the knowledge from teacher network could accommodate the size of student network. Extensive experimental results on three typical benchmark datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, demonstrate that our method achieves highly superior performances against state-of-the-art methods.

Download Full-text

Video captioning with stacked attention and semantic hard pull

PeerJ Computer Science ◽

10.7717/peerj-cs.664 ◽

2021 ◽

Vol 7 ◽

pp. e664

Author(s):

Md. Mushfiqur Rahman ◽

Thasin Abedin ◽

Khondokar S.S. Prottoy ◽

Ayana Moshruba ◽

Fazlul Hasan Siddiqui

Keyword(s):

Qualitative Analysis ◽

Language Processing ◽

State Of The Art ◽

The Other ◽

Video Sequences ◽

Automated Scoring ◽

Video Captioning ◽

Human Evaluation ◽

Novel Approaches ◽

Evaluation Metric

Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers—one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches—“stacked attention” and “spatial hard pull”. As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.

Download Full-text

Ensemble Soft-Margin Softmax Loss for Image Classification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/138 ◽

2018 ◽

Cited By ~ 11

Author(s):

Xiaobo Wang ◽

Shifeng Zhang ◽

Zhen Lei ◽

Si Liu ◽

Xiaojie Guo ◽

...

Keyword(s):

Image Classification ◽

State Of The Art ◽

The Other ◽

Other Hand ◽

Soft Margin ◽

Benchmark Datasets ◽

The One ◽

Independence Criterion

Softmax loss is arguably one of the most popular losses to train CNN models for image classification. However, recent works have exposed its limitation on feature discriminability. This paper casts a new viewpoint on the weakness of softmax loss. On the one hand, the CNN features learned using the softmax loss are often inadequately discriminative. We hence introduce a soft-margin softmax function to explicitly encourage the discrmination between different classes. On the other hand, the learned classifier of softmax loss is weak. We propose to assemble multiple these weak classifiers to a strong one, inspired by the recognition that the diversity among weak classifiers is critical to a good ensemble. To achieve the diversity, we adopt the Hilbert-Schmidt Independence Criterion (HSIC). Considering these two aspects in one framework, we design a novel loss, named as Ensemble Soft-Margin Softmax (EM-Softmax). Extensive experiments on benchmark datasets are conducted to show the superiority of our design over the baseline softmax loss and several state-of-the-art alternatives.

Download Full-text

Computer Vision and robotics in postal automation

Human Systems Management ◽

10.3233/hsm-1999-183-411 ◽

1999 ◽

Vol 18 (3-4) ◽

pp. 265-273

Author(s):

Giovanni B. Garibotto

Keyword(s):

Image Processing ◽

Computer Vision ◽

Pattern Recognition ◽

Material Handling ◽

State Of The Art ◽

Short Description ◽

The Other ◽

Functional Requirements ◽

Postal Automation ◽

And Robotics

The paper is intended to provide an overview of advanced robotic technologies within the context of Postal Automation services. The main functional requirements of the application are briefly referred, as well as the state of the art and new emerging solutions. Image Processing and Pattern Recognition have always played a fundamental role in Address Interpretation and Mail sorting and the new challenging objective is now off-line handwritten cursive recognition, in order to be able to handle all kind of addresses in a uniform way. On the other hand, advanced electromechanical and robotic solutions are extremely important to solve the problems of mail storage, transportation and distribution, as well as for material handling and logistics. Finally a short description of new services of Postal Automation is referred, by considering new emerging services of hybrid mail and paper to electronic conversion.

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text