SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Download Full-text

Zero-shot Metric Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/555 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xinyi Xu ◽

Huanhuan Cao ◽

Yanhua Yang ◽

Erkun Yang ◽

Cheng Deng

Keyword(s):

State Of The Art ◽

Metric Learning ◽

Visual Similarity ◽

Distance Metric ◽

Learning Problem ◽

Combine Data ◽

Benchmark Datasets ◽

Novel Method ◽

Multiple Relation ◽

Continuous Relation

In this work, we tackle the zero-shot metric learning problem and propose a novel method abbreviated as ZSML, with the purpose to learn a distance metric that measures the similarity of unseen categories (even unseen datasets). ZSML achieves strong transferability by capturing multi-nonlinear yet continuous relation among data. It is motivated by two facts: 1) relations can be essentially described from various perspectives; and 2) traditional binary supervision is insufficient to represent continuous visual similarity. Specifically, we first reformulate a collection of specific-shaped convolutional kernels to combine data pairs and generate multiple relation vectors. Furthermore, we design a new cross-update regression loss to discover continuous similarity. Extensive experiments including intra-dataset transfer and inter-dataset transfer on four benchmark datasets demonstrate that ZSML can achieve state-of-the-art performance.

Download Full-text

Imbalanced Learning Based on Logistic Discrimination

Computational Intelligence and Neuroscience ◽

10.1155/2016/5423204 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Weimei Zhi ◽

Hongbing Liu ◽

Mingliang Xu

Keyword(s):

Statistical Model ◽

Cost Function ◽

State Of The Art ◽

Class Imbalance ◽

Imbalanced Learning ◽

Learning Problem ◽

Logistic Discrimination ◽

Positive Class ◽

Negative Class ◽

Novel Method

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall,g-mean,f-measure, AUC, and accuracy.

Download Full-text

Unified Vision-Language Pre-Training for Image Captioning and VQA

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.7005 ◽

2020 ◽

Vol 34 (07) ◽

pp. 13041-13049 ◽

Cited By ~ 11

Author(s):

Luowei Zhou ◽

Hamid Palangi ◽

Lei Zhang ◽

Houdong Hu ◽

Jason Corso ◽

...

Keyword(s):

Unsupervised Learning ◽

Question Answering ◽

State Of The Art ◽

Learning Objectives ◽

Image Captioning ◽

Language Generation ◽

Visual Question Answering ◽

Benchmark Datasets

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Download Full-text

Transductive Relation-Propagation Network for Few-shot Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/112 ◽

2020 ◽

Author(s):

Yuqing Ma ◽

Shihao Bai ◽

Shan An ◽

Wei Liu ◽

Aishan Liu ◽

...

Keyword(s):

Neural Network ◽

Learning Strategy ◽

State Of The Art ◽

Challenging Problem ◽

Learning Problem ◽

Graph Node ◽

Learning Methods ◽

Transductive Learning ◽

Benchmark Datasets ◽

Propagation Network

Few-shot learning, aiming to learn novel concepts from few labeled examples, is an interesting and very challenging problem with many practical advantages. To accomplish this task, one should concentrate on revealing the accurate relations of the support-query pairs. We propose a transductive relation-propagation graph neural network (TRPN) to explicitly model and propagate such relations across support-query pairs. Our TRPN treats the relation of each support-query pair as a graph node, named relational node, and resorts to the known relations between support samples, including both intra-class commonality and inter-class uniqueness, to guide the relation propagation in the graph, generating the discriminative relation embeddings for support-query pairs. A pseudo relational node is further introduced to propagate the query characteristics, and a fast, yet effective transductive learning strategy is devised to fully exploit the relation information among different queries. To the best of our knowledge, this is the first work that explicitly takes the relations of support-query pairs into consideration in few-shot learning, which might offer a new way to solve the few-shot learning problem. Extensive experiments conducted on several benchmark datasets demonstrate that our method can significantly outperform a variety of state-of-the-art few-shot learning methods.

Download Full-text

Motion Guided Spatial Attention for Video Captioning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018191 ◽

2019 ◽

Vol 33 ◽

pp. 8191-8198 ◽

Cited By ~ 7

Author(s):

Shaoxiang Chen ◽

Yu-Gang Jiang

Keyword(s):

Spatial Attention ◽

Apparent Motion ◽

State Of The Art ◽

Human Perception ◽

The Other ◽

Motion Information ◽

Video Captioning ◽

Video Frames ◽

Benchmark Datasets ◽

Rich Information

Sequence-to-sequence models incorporated with attention mechanism have shown promising improvements on video captioning. While there is rich information both inside and between frames, spatial attention is rarely explored and motion information is usually handled by 3D-CNNs as just another modality for fusion. On the other hand, researches about human perception suggest that apparent motion can attract attention. Motivated by this, we aim to learn spatial attention on video frames under the guidance of motion information for caption generation. We present a novel video captioning framework by utilizing Motion Guided Spatial Attention (MGSA). The proposed MGSA exploits the motion between video frames by learning spatial attention from stacked optical flow images with a custom CNN. To further relate the spatial attention maps of video frames, we designed a Gated Attention Recurrent Unit (GARU) to adaptively incorporate previous attention maps. The whole framework can be trained in an end-to-end manner. We evaluate our approach on two benchmark datasets, MSVD and MSR-VTT. The experiments show that our designed model can generate better video representation and state of the art results are obtained under popular evaluation metrics such as BLEU@4, CIDEr, and METEOR.

Download Full-text

Invertible Autoencoder for Domain Adaptation

Computation ◽

10.3390/computation7020020 ◽

2019 ◽

Vol 7 (2) ◽

pp. 20 ◽

Cited By ~ 2

Author(s):

Yunfei Teng ◽

Anna Choromanska

Keyword(s):

Domain Adaptation ◽

State Of The Art ◽

Joint Probability ◽

Autonomous Driving ◽

Learning System ◽

Joint Probability Distribution ◽

Learning Problem ◽

Image Translation ◽

Benchmark Datasets ◽

Image Pairs

The unsupervised image-to-image translation aims at finding a mapping between the source ( A ) and target ( B ) image domains, where in many applications aligned image pairs are not available at training. This is an ill-posed learning problem since it requires inferring the joint probability distribution from marginals. Joint learning of coupled mappings F A B : A → B and F B A : B → A is commonly used by the state-of-the-art methods, like CycleGAN to learn this translation by introducing cycle consistency requirement to the learning problem, i.e., F A B ( F B A ( B ) ) ≈ B and F B A ( F A B ( A ) ) ≈ A . Cycle consistency enforces the preservation of the mutual information between input and translated images. However, it does not explicitly enforce F B A to be an inverse operation to F A B . We propose a new deep architecture that we call invertible autoencoder (InvAuto) to explicitly enforce this relation. This is done by forcing an encoder to be an inverted version of the decoder, where corresponding layers perform opposite mappings and share parameters. The mappings are constrained to be orthonormal. The resulting architecture leads to the reduction of the number of trainable parameters (up to 2 times). We present image translation results on benchmark datasets and demonstrate state-of-the art performance of our approach. Finally, we test the proposed domain adaptation method on the task of road video conversion. We demonstrate that the videos converted with InvAuto have high quality and show that the NVIDIA neural-network-based end-to-end learning system for autonomous driving, known as PilotNet, trained on real road videos performs well when tested on the converted ones.

Download Full-text

COVID-19 infection map generation and detection from chest X-ray images

Health Information Science and Systems ◽

10.1007/s13755-021-00146-8 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Aysen Degerli ◽

Mete Ahishali ◽

Mehmet Yamac ◽

Serkan Kiranyaz ◽

Muhammad E. H. Chowdhury ◽

...

Keyword(s):

State Of The Art ◽

Ground Truth ◽

Clinical Use ◽

X Ray ◽

Learning Techniques ◽

Map Generation ◽

Severity Grading ◽

Chest X Ray ◽

Novel Method ◽

Aided Diagnosis

AbstractComputer-aided diagnosis has become a necessity for accurate and immediate coronavirus disease 2019 (COVID-19) detection to aid treatment and prevent the spread of the virus. Numerous studies have proposed to use Deep Learning techniques for COVID-19 diagnosis. However, they have used very limited chest X-ray (CXR) image repositories for evaluation with a small number, a few hundreds, of COVID-19 samples. Moreover, these methods can neither localize nor grade the severity of COVID-19 infection. For this purpose, recent studies proposed to explore the activation maps of deep networks. However, they remain inaccurate for localizing the actual infestation making them unreliable for clinical use. This study proposes a novel method for the joint localization, severity grading, and detection of COVID-19 from CXR images by generating the so-called infection maps. To accomplish this, we have compiled the largest dataset with 119,316 CXR images including 2951 COVID-19 samples, where the annotation of the ground-truth segmentation masks is performed on CXRs by a novel collaborative human–machine approach. Furthermore, we publicly release the first CXR dataset with the ground-truth segmentation masks of the COVID-19 infected regions. A detailed set of experiments show that state-of-the-art segmentation networks can learn to localize COVID-19 infection with an F1-score of 83.20%, which is significantly superior to the activation maps created by the previous methods. Finally, the proposed approach achieved a COVID-19 detection performance with 94.96% sensitivity and 99.88% specificity.

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text

An All-Batch Loss for Constructing Prediction Intervals

Applied Sciences ◽

10.3390/app11041728 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1728

Author(s):

Hua Zhong ◽

Li Xu

Keyword(s):

Gradient Descent ◽

State Of The Art ◽

Prediction Interval ◽

Feedforward Neural Networks ◽

Important Research ◽

Likelihood Principle ◽

High Quality ◽

Construction Methods ◽

Important Research Topic ◽

Benchmark Datasets

The prediction interval (PI) is an important research topic in reliability analyses and decision support systems. Data size and computation costs are two of the issues which may hamper the construction of PIs. This paper proposes an all-batch (AB) loss function for constructing high quality PIs. Taking the full advantage of the likelihood principle, the proposed loss makes it possible to train PI generation models using the gradient descent (GD) method for both small and large batches of samples. With the structure of dual feedforward neural networks (FNNs), a high-quality PI generation framework is introduced, which can be adapted to a variety of problems including regression analysis. Numerical experiments were conducted on the benchmark datasets; the results show that higher-quality PIs were achieved using the proposed scheme. Its reliability and stability were also verified in comparison with various state-of-the-art PI construction methods.

Download Full-text

ART-UP: A Novel Method for Generating Scanning-Robust Aesthetic QR Codes

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3418214 ◽

2021 ◽

Vol 17 (1) ◽

pp. 1-23

Author(s):

Mingliang Xu ◽

Qingfeng Li ◽

Jianwei Niu ◽

Hao Su ◽

Xiting Liu ◽

...

Keyword(s):

State Of The Art ◽

Visual Quality ◽

Qr Code ◽

Quick Response ◽

Estimation Model ◽

Qr Codes ◽

Excellent Performance ◽

Novel Method ◽

Coarse To Fine

Quick response (QR) codes are usually scanned in different environments, so they must be robust to variations in illumination, scale, coverage, and camera angles. Aesthetic QR codes improve the visual quality, but subtle changes in their appearance may cause scanning failure. In this article, a new method to generate scanning-robust aesthetic QR codes is proposed, which is based on a module-based scanning probability estimation model that can effectively balance the tradeoff between visual quality and scanning robustness. Our method locally adjusts the luminance of each module by estimating the probability of successful sampling. The approach adopts the hierarchical, coarse-to-fine strategy to enhance the visual quality of aesthetic QR codes, which sequentially generate the following three codes: a binary aesthetic QR code, a grayscale aesthetic QR code, and the final color aesthetic QR code. Our approach also can be used to create QR codes with different visual styles by adjusting some initialization parameters. User surveys and decoding experiments were adopted for evaluating our method compared with state-of-the-art algorithms, which indicates that the proposed approach has excellent performance in terms of both visual quality and scanning robustness.

Download Full-text