FASTER Recurrent Networks for Efficient Video Classification

Typical video classification methods often divide a video into short clips, do inference on each clip independently, then aggregate the clip-level predictions to generate the video-level results. However, processing visually similar clips independently ignores the temporal structure of the video sequence, and increases the computational cost at inference time. In this paper, we propose a novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy. FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities. The FASTER framework can integrate high quality representations from expensive models to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. A new recurrent network (i.e., FAST-GRU) is designed to aggregate the mixture of different representations. Compared with existing approaches, FASTER can reduce the FLOPs by over 10× while maintaining the state-of-the-art accuracy across popular datasets, such as Kinetics, UCF-101 and HMDB-51.

Download Full-text

Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/98 ◽

2018 ◽

Cited By ~ 10

Author(s):

Hehe Fan ◽

Zhongwen Xu ◽

Linchao Zhu ◽

Chenggang Yan ◽

Jianjun Ge ◽

...

Keyword(s):

Large Scale ◽

Sampling Rate ◽

Computational Cost ◽

Confidence Score ◽

Video Content ◽

Video Classification ◽

Video Frames ◽

Similar Accuracy ◽

Efficient Video

We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.

Download Full-text

JAZZ MELODY GENERATION USING RECURRENT NETWORKS AND REINFORCEMENT LEARNING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213006002849 ◽

2006 ◽

Vol 15 (04) ◽

pp. 623-650

Author(s):

JUDY A. FRANKLIN

Keyword(s):

Reinforcement Learning ◽

Dynamic Systems ◽

Recurrent Neural Networks ◽

Short Term Memory ◽

State Of The Art ◽

Recurrent Network ◽

Recurrent Networks ◽

Short Term ◽

Long Short Term Memory ◽

Lstm Network

Recurrent (neural) networks have been deployed as models for learning musical processes, by computational scientists who study processes such as dynamic systems. Over time, more intricate music has been learned as the state of the art in recurrent networks improves. One particular recurrent network, the Long Short-Term Memory (LSTM) network shows promise for learning long songs, and generating new songs. We are experimenting with a module containing two inter-recurrent LSTM networks to cooperatively learn several human melodies, based on the songs' harmonic structures, and on the feedback inherent in the network. We show that these networks can learn to reproduce four human melodies. We then present as input new harmonizations, so as to generate new songs. We describe the reharmonizations, and show the new melodies that result. We also present a hierarchical structure for using reinforcement learning to choose LSTM modules during the course of melody generation.

Download Full-text

Periodic-CRN: A Convolutional Recurrent Model for Crowd Density Prediction with Recurring Periodic Patterns

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/519 ◽

2018 ◽

Cited By ~ 17

Author(s):

Ali Zonoozi ◽

Jung-jae Kim ◽

Xiao-Li Li ◽

Gao Cong

Keyword(s):

Traffic Management ◽

State Of The Art ◽

Recurrent Network ◽

Temporal Data ◽

Periodic Patterns ◽

Temporal Correlations ◽

Density Prediction ◽

Crowd Density ◽

Spatial Domains ◽

Spatio Temporal

Time-series forecasting in geo-spatial domains has important applications, including urban planning, traffic management and behavioral analysis. We observed recurring periodic patterns in some spatio-temporal data, which were not considered explicitly by previous non-linear works. To address this lack, we propose novel `Periodic-CRN' (PCRN) method, which adapts convolutional recurrent network (CRN) to accurately capture spatial and temporal correlations, learns and incorporates explicit periodic representations, and can be optimized with multi-step ahead prediction. We show that PCRN consistently outperforms the state-of-the-art methods for crowd density prediction across two taxi datasets from Beijing and Singapore.

Download Full-text

FUTUREGAN: ANTICIPATING THE FUTURE FRAMES OF VIDEO SEQUENCES USING SPATIO-TEMPORAL 3D CONVOLUTIONS IN PROGRESSIVELY GROWING GANS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w16-3-2019 ◽

2019 ◽

Vol XLII-2/W16 ◽

pp. 3-11 ◽

Cited By ~ 6

Author(s):

S. Aigner ◽

M. Körner

Keyword(s):

Video Sequence ◽

State Of The Art ◽

Input Sequence ◽

Video Sequences ◽

Complex Task ◽

High Quality ◽

Spatio Temporal ◽

Resolution Single ◽

Additional Constraints ◽

Video Prediction

Abstract. We introduce a new encoder-decoder GAN model, FutureGAN, that predicts future frames of a video sequence conditioned on a sequence of past frames. During training, the networks solely receive the raw pixel values as an input, without relying on additional constraints or dataset specific conditions. To capture both the spatial and temporal components of a video sequence, spatio-temporal 3d convolutions are used in all encoder and decoder modules. Further, we utilize concepts of the existing progressively growing GAN (PGGAN) that achieves high-quality results on generating high-resolution single images. The FutureGAN model extends this concept to the complex task of video prediction. We conducted experiments on three different datasets, MovingMNIST, KTH Action, and Cityscapes. Our results show that the model learned representations to transform the information of an input sequence into a plausible future sequence effectively for all three datasets. The main advantage of the FutureGAN framework is that it is applicable to various different datasets without additional changes, whilst achieving stable results that are competitive to the state-of-the-art in video prediction. The code to reproduce the results of this paper is publicly available at https://github.com/TUM-LMF/FutureGAN.

Download Full-text

Contiguous Loss for Motion-Based, Non-Aligned Image Deblurring

Symmetry ◽

10.3390/sym13040630 ◽

2021 ◽

Vol 13 (4) ◽

pp. 630

Author(s):

Wenjia Niu ◽

Kewen Xia ◽

Yongke Pan

Keyword(s):

Video Sequence ◽

State Of The Art ◽

Video Sequences ◽

Dynamic Scenes ◽

Multiple Objects ◽

Proposed Model ◽

Ill Posed ◽

Spatio Temporal ◽

Better Than ◽

General Dynamic

In general dynamic scenes, blurring is the result of the motion of multiple objects, camera shaking or scene depth variations. As an inverse process, deblurring extracts a sharp video sequence from the information contained in one single blurry image—it is itself an ill-posed computer vision problem. To reconstruct these sharp frames, traditional methods aim to build several convolutional neural networks (CNN) to generate different frames, resulting in expensive computation. To vanquish this problem, an innovative framework which can generate several sharp frames based on one CNN model is proposed. The motion-based image is put into our framework and the spatio-temporal information is encoded via several convolutional and pooling layers, and the output of our model is several sharp frames. Moreover, a blurry image does not have one-to-one correspondence with any sharp video sequence, since different video sequences can create similar blurry images, so neither the traditional pixel2pixel nor perceptual loss is suitable for focusing on non-aligned data. To alleviate this problem and model the blurring process, a novel contiguous blurry loss function is proposed which focuses on measuring the loss of non-aligned data. Experimental results show that the proposed model combined with the contiguous blurry loss can generate sharp video sequences efficiently and perform better than state-of-the-art methods.

Download Full-text

Localizing Natural Language in Videos

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018175 ◽

2019 ◽

Vol 33 ◽

pp. 8175-8182 ◽

Cited By ~ 13

Author(s):

Jingyuan Chen ◽

Lin Ma ◽

Xinpeng Chen ◽

Zequn Jie ◽

Jiebo Luo

Keyword(s):

Natural Language ◽

Video Sequence ◽

State Of The Art ◽

Recurrent Networks ◽

The Public ◽

Fine Grained ◽

Proposed Model ◽

Boundary Model ◽

The Given ◽

Language Description

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

Download Full-text

Documentary data and the study of past droughts: a global state of the art

Climate of the Past ◽

10.5194/cp-14-1915-2018 ◽

2018 ◽

Vol 14 (12) ◽

pp. 1915-1960 ◽

Cited By ~ 34

Author(s):

Rudolf Brázdil ◽

Andrea Kiss ◽

Jürg Luterbacher ◽

David J. Nash ◽

Ladislava Řezníčková

Keyword(s):

Large Scale ◽

State Of The Art ◽

Drought Indices ◽

Documentary Evidence ◽

Climatic Trends ◽

Instrumental Observations ◽

Spatio Temporal ◽

Epigraphic Evidence ◽

Administrative Evidence

Abstract. The use of documentary evidence to investigate past climatic trends and events has become a recognised approach in recent decades. This contribution presents the state of the art in its application to droughts. The range of documentary evidence is very wide, including general annals, chronicles, memoirs and diaries kept by missionaries, travellers and those specifically interested in the weather; records kept by administrators tasked with keeping accounts and other financial and economic records; legal-administrative evidence; religious sources; letters; songs; newspapers and journals; pictographic evidence; chronograms; epigraphic evidence; early instrumental observations; society commentaries; and compilations and books. These are available from many parts of the world. This variety of documentary information is evaluated with respect to the reconstruction of hydroclimatic conditions (precipitation, drought frequency and drought indices). Documentary-based drought reconstructions are then addressed in terms of long-term spatio-temporal fluctuations, major drought events, relationships with external forcing and large-scale climate drivers, socio-economic impacts and human responses. Documentary-based drought series are also considered from the viewpoint of spatio-temporal variability for certain continents, and their employment together with hydroclimate reconstructions from other proxies (in particular tree rings) is discussed. Finally, conclusions are drawn, and challenges for the future use of documentary evidence in the study of droughts are presented.

Download Full-text

A Fast Estimator for Binary Choice Models with Spatial, Temporal, and Spatio-Temporal Interdependence

Political Analysis ◽

10.1017/pan.2020.54 ◽

2021 ◽

pp. 1-7

Author(s):

Julian Wucherpfennig ◽

Aya Kachi ◽

Nils-Christian Bormann ◽

Philipp Hunziker

Keyword(s):

Computational Cost ◽

Choice Models ◽

Reduced Form ◽

Binary Choice ◽

Monte Carlo Experiments ◽

Binary Choice Models ◽

Spatio Temporal ◽

Computationally Intensive ◽

Pseudo Maximum Likelihood ◽

Parameter Values

Abstract Binary outcome models are frequently used in the social sciences and economics. However, such models are difficult to estimate with interdependent data structures, including spatial, temporal, and spatio-temporal autocorrelation because jointly determined error terms in the reduced-form specification are generally analytically intractable. To deal with this problem, simulation-based approaches have been proposed. However, these approaches (i) are computationally intensive and impractical for sizable datasets commonly used in contemporary research, and (ii) rarely address temporal interdependence. As a way forward, we demonstrate how to reduce the computational burden significantly by (i) introducing analytically-tractable pseudo maximum likelihood estimators for latent binary choice models that exhibit interdependence across space and time and by (ii) proposing an implementation strategy that increases computational efficiency considerably. Monte Carlo experiments show that our estimators recover the parameter values as good as commonly used estimation alternatives and require only a fraction of the computational cost.

Download Full-text

App2Vec: Context-Aware Application Usage Prediction

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451396 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-21

Author(s):

Huandong Wang ◽

Yong Li ◽

Mu Du ◽

Zhenhui Li ◽

Depeng Jin

Keyword(s):

Dirichlet Process ◽

Service Providers ◽

State Of The Art ◽

Representation Learning ◽

Context Aware ◽

Challenging Problem ◽

Performance Gap ◽

Bayesian Mixture Model ◽

Bayesian Mixture ◽

Spatio Temporal

Both app developers and service providers have strong motivations to understand when and where certain apps are used by users. However, it has been a challenging problem due to the highly skewed and noisy app usage data. Moreover, apps are regarded as independent items in existing studies, which fail to capture the hidden semantics in app usage traces. In this article, we propose App2Vec, a powerful representation learning model to learn the semantic embedding of apps with the consideration of spatio-temporal context. Based on the obtained semantic embeddings, we develop a probabilistic model based on the Bayesian mixture model and Dirichlet process to capture when , where , and what semantics of apps are used to predict the future usage. We evaluate our model using two different app usage datasets, which involve over 1.7 million users and 2,000+ apps. Evaluation results show that our proposed App2Vec algorithm outperforms the state-of-the-art algorithms in app usage prediction with a performance gap of over 17.0%.

Download Full-text