scholarly journals IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation

Author(s):  
Kangle Deng ◽  
Tianyi Fei ◽  
Xin Huang ◽  
Yuxin Peng

Automatically generating videos according to the given text is a highly challenging task, where visual quality and semantic consistency with captions are two critical issues. In existing methods, when generating a specific frame, the information in those frames generated before is not fully exploited. And an effective way to measure the semantic accordance between videos and captions remains to be established. To address these issues, we present a novel Introspective Recurrent Convolutional GAN (IRC-GAN) approach. First, we propose a recurrent transconvolutional generator, where LSTM cells are integrated with 2D transconvolutional layers. As 2D transconvolutional layers put more emphasis on the details of each frame than 3D ones, our generator takes both the definition of each video frame and temporal coherence across the whole video into consideration, and thus can generate videos with better visual quality. Second, we propose mutual information introspection to semantically align the generated videos to text. Unlike other methods simply judging whether the video and the text match or not, we further take mutual information to concretely measure the semantic consistency. In this way,  our model is able to introspect the semantic distance between the generated video and the corresponding text, and try to minimize it to boost the semantic consistency.We conduct experiments on 3 datasets and compare with state-of-the-art methods. Experimental results demonstrate the effectiveness of our IRC-GAN to generate plausible videos from given text.

2012 ◽  
Vol 4 (3) ◽  
pp. 20-32 ◽  
Author(s):  
Yongjian Hu ◽  
Chang-Tsun Li ◽  
Yufei Wang ◽  
Bei-bei Liu

Frame duplication is a common way of digital video forgeries. State-of-the-art approaches of duplication detection usually suffer from heavy computational load. In this paper, the authors propose a new algorithm to detect duplicated frames based on video sub-sequence fingerprints. The fingerprints employed are extracted from the DCT coefficients of the temporally informative representative images (TIRIs) of the sub-sequences. Compared with other similar algorithms, this study focuses on improving fingerprints representing video sub-sequences and introducing a simple metric for the matching of video sub-sequences. Experimental results show that the proposed algorithm overall outperforms three related duplication forgery detection algorithms in terms of computational efficiency, detection accuracy and robustness against common video operations like compression and brightness change.


Author(s):  
Sunghwan Joo ◽  
Sungmin Cha ◽  
Taesup Moon

We propose DoPAMINE, a new neural network based multiplicative noise despeckling algorithm. Our algorithm is inspired by Neural AIDE (N-AIDE), which is a recently proposed neural adaptive image denoiser. While the original NAIDE was designed for the additive noise case, we show that the same framework, i.e., adaptively learning a network for pixel-wise affine denoisers by minimizing an unbiased estimate of MSE, can be applied to the multiplicative noise case as well. Moreover, we derive a double-sided masked CNN architecture which can control the variance of the activation values in each layer and converge fast to high denoising performance during supervised training. In the experimental results, we show our DoPAMINE possesses high adaptivity via fine-tuning the network parameters based on the given noisy image and achieves significantly better despeckling results compared to SAR-DRN, a state-of-the-art CNN-based algorithm.


Symmetry ◽  
2019 ◽  
Vol 11 (5) ◽  
pp. 619 ◽  
Author(s):  
Ha-Eun Ahn ◽  
Jinwoo Jeong ◽  
Je Woo Kim

Visual quality and algorithm efficiency are two main interests in video frame interpolation. We propose a hybrid task-based convolutional neural network for fast and accurate frame interpolation of 4K videos. The proposed method synthesizes low-resolution frames, then reconstructs high-resolution frames in a coarse-to-fine fashion. We also propose edge loss, to preserve high-frequency information and make the synthesized frames look sharper. Experimental results show that the proposed method achieves state-of-the-art performance and performs 2.69x faster than the existing methods that are operable for 4K videos, while maintaining comparable visual and quantitative quality.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 416
Author(s):  
Changmeng Peng ◽  
Luting Cai ◽  
Xiaoyang Huang ◽  
Zhizhong Fu ◽  
Jin Xu ◽  
...  

It is a challenge to transmit and store the massive visual data generated in the Visual Internet of Things (VIoT), so the compression of the visual data is of great significance to VIoT. Compressing bit-depth of images is very cost-effective to reduce the large volume of visual data. However, compressing the bit-depth will introduce false contour, and color distortion would occur in the reconstructed image. False contour and color distortion suppression become critical issues of the bit-depth enhancement in VIoT. To solve these problems, a Bit-depth Enhancement method with AUTO-encoder-like structure (BE-AUTO) is proposed in this paper. Based on the convolution-combined-with-deconvolution codec and global skip of BE-AUTO, this method can effectively suppress false contour and color distortion, thus achieving the state-of-the-art objective metric and visual quality in the reconstructed images, making it more suitable for bit-depth enhancement in VIoT.


Author(s):  
Fei Shang ◽  
◽  
Huaxiang Zhang ◽  
Jiande Sun ◽  
Li Liu ◽  
...  

Unlike traditional methods that directly map different modalities into an isomorphic subspace for cross-media retrieval, this paper proposes a cross-media retrieval algorithm based on the consistency of collaborative representation (called CR-CMR). In order to measure the similarity between data coming from different modalities, CR-CMR first takes the advantage of dictionary learning techniques to obtain homogeneous collaborative representation for texts and images, then, it considers the semantic consistency of different modalities simultaneously and maps the collaborative representation coefficients into an isomorphic semantic subspace to conduct cross-media retrieval. Experimental results on three state-of-the-art datasets show that the algorithm is effective.


2020 ◽  
Vol 34 (05) ◽  
pp. 8368-8375
Author(s):  
Zibo Lin ◽  
Ziran Li ◽  
Ning Ding ◽  
Hai-Tao Zheng ◽  
Ying Shen ◽  
...  

Paraphrase generation aims to rewrite a text with different words while keeping the same meaning. Previous work performs the task based solely on the given dataset while ignoring the availability of external linguistic knowledge. However, it is intuitive that a model can generate more expressive and diverse paraphrase with the help of such knowledge. To fill this gap, we propose Knowledge-Enhanced Paraphrase Network (KEPN), a transformer-based framework that can leverage external linguistic knowledge to facilitate paraphrase generation. (1) The model integrates synonym information from the external linguistic knowledge into the paraphrase generator, which is used to guide the decision on whether to generate a new word or replace it with a synonym. (2) To locate the synonym pairs more accurately, we adopt an incremental encoding scheme to incorporate position information of each synonym. Besides, a multi-task architecture is designed to help the framework jointly learn the selection of synonym pairs and the generation of expressive paraphrase. Experimental results on both English and Chinese datasets show that our method significantly outperforms the state-of-the-art approaches in terms of both automatic and human evaluation.


2020 ◽  
Vol 34 (05) ◽  
pp. 9122-9129
Author(s):  
Hai Wan ◽  
Yufei Yang ◽  
Jianfeng Du ◽  
Yanan Liu ◽  
Kunxun Qi ◽  
...  

Aspect-based sentiment analysis (ABSA) aims to detect the targets (which are composed by continuous words), aspects and sentiment polarities in text. Published datasets from SemEval-2015 and SemEval-2016 reveal that a sentiment polarity depends on both the target and the aspect. However, most of the existing methods consider predicting sentiment polarities from either targets or aspects but not from both, thus they easily make wrong predictions on sentiment polarities. In particular, where the target is implicit, i.e., it does not appear in the given text, the methods predicting sentiment polarities from targets do not work. To tackle these limitations in ABSA, this paper proposes a novel method for target-aspect-sentiment joint detection. It relies on a pre-trained language model and can capture the dependence on both targets and aspects for sentiment prediction. Experimental results on the SemEval-2015 and SemEval-2016 restaurant datasets show that the proposed method achieves a high performance in detecting target-aspect-sentiment triples even for the implicit target cases; moreover, it even outperforms the state-of-the-art methods for those subtasks of target-aspect-sentiment detection that they are competent to.


2013 ◽  
Vol 21 (2) ◽  
pp. 201-226 ◽  
Author(s):  
DEYI XIONG ◽  
MIN ZHANG

AbstractThe language model is one of the most important knowledge sources for statistical machine translation. In this article, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We introduce algorithms to integrate the two proposed models into two kinds of state-of-the-art phrase-based decoders. Our experimental results on Chinese/Spanish/Vietnamese-to-English show that both models are able to significantly improve translation quality in terms of BLEU and METEOR over a competitive baseline.


Author(s):  
Yu-Lun Liu ◽  
Yi-Tung Liao ◽  
Yen-Yu Lin ◽  
Yung-Yu Chuang

Video frame interpolation algorithms predict intermediate frames to produce videos with higher frame rates and smooth view transitions given two consecutive frames as inputs. We propose that: synthesized frames are more reliable if they can be used to reconstruct the input frames with high quality. Based on this idea, we introduce a new loss term, the cycle consistency loss. The cycle consistency loss can better utilize the training data to not only enhance the interpolation results, but also maintain the performance better with less training data. It can be integrated into any frame interpolation network and trained in an end-to-end manner. In addition to the cycle consistency loss, we propose two extensions: motion linearity loss and edge-guided training. The motion linearity loss approximates the motion between two input frames to be linear and regularizes the training. By applying edge-guided training, we further improve results by integrating edge information into training. Both qualitative and quantitative experiments demonstrate that our model outperforms the state-of-the-art methods. The source codes of the proposed method and more experimental results will be available at https://github.com/alex04072000/CyclicGen.


Author(s):  
Feng Zheng ◽  
Xin Miao ◽  
Heng Huang

Identifying vehicles across cameras in traffic surveillance is fundamentally important for public safety purposes. However, despite some preliminary work, the rapid vehicle search in large-scale datasets has not been investigated. Moreover, modelling a view-invariant similarity between vehicle images from different views is still highly challenging. To address the problems, in this paper, we propose a Ranked Semantic Sampling (RSS) guided binary embedding method for fast cross-view vehicle Re-IDentification (Re-ID). The search can be conducted by efficiently computing similarities in the projected space. Unlike previous methods using random sampling, we design tree-structured attributes to guide the mini-batch sampling. The ranked pairs of hard samples in the mini-batch can improve the convergence of optimization. By minimizing a novel ranked semantic distance loss defined according to the structure, the learned Hamming distance is view-invariant, which enables cross-view Re-ID. The experimental results demonstrate that RSS outperforms the state-of-the-art approaches and the learned embedding from one dataset can be transferred to achieve the task of vehicle Re-ID on another dataset.


Sign in / Sign up

Export Citation Format

Share Document