Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation

With the advancement of the technological field, day by day, people from around the world are having easier access to internet abled devices, and as a result, video data is growing rapidly. The increase of portable devices such as various action cameras, mobile cameras, motion cameras, etc., can also be considered for the faster growth of video data. Data from these multiple sources need more maintenance to process for various usages according to the needs. By considering these enormous amounts of video data, it cannot be navigated fully by the end-users. Throughout recent times, many research works have been done to generate descriptions from the images or visual scene recordings to address the mentioned issue. This description generation, also known as video captioning, is more complex than single image captioning. Various advanced neural networks have been used in various studies to perform video captioning. In this paper, we propose an attention-based Bi-LSTM and sequential LSTM (Att-BiL-SL) encoder-decoder model for describing the video in textual format. The model consists of two-layer attention-based bi-LSTM and one-layer sequential LSTM for video captioning. The model also extracts the universal and native temporal features from the video frames for smooth sentence generation from optical frames. This paper includes the word embedding with a soft attention mechanism and a beam search optimization algorithm to generate qualitative results. It is found that the architecture proposed in this paper performs better than various existing state of the art models.

Download Full-text

Detecting Toe-Off Events Utilizing a Vision-Based Method

Entropy ◽

10.3390/e21040329 ◽

2019 ◽

Vol 21 (4) ◽

pp. 329 ◽

Cited By ~ 4

Author(s):

Yunqi Tang ◽

Zhuorong Li ◽

Huawei Tian ◽

Jianwei Ding ◽

Bingxian Lin

Keyword(s):

Wearable Sensors ◽

Gait Pattern ◽

Video Data ◽

Detection Methods ◽

Detection Accuracy ◽

Public Database ◽

Video Frames ◽

Different Types ◽

Events Detection ◽

Good Detection

Detecting gait events from video data accurately would be a challenging problem. However, most detection methods for gait events are currently based on wearable sensors, which need high cooperation from users and power consumption restriction. This study presents a novel algorithm for achieving accurate detection of toe-off events using a single 2D vision camera without the cooperation of participants. First, a set of novel feature, namely consecutive silhouettes difference maps (CSD-maps), is proposed to represent gait pattern. A CSD-map can encode several consecutive pedestrian silhouettes extracted from video frames into a map. And different number of consecutive pedestrian silhouettes will result in different types of CSD-maps, which can provide significant features for toe-off events detection. Convolutional neural network is then employed to reduce feature dimensions and classify toe-off events. Experiments on a public database demonstrate that the proposed method achieves good detection accuracy.

Download Full-text

Efficient RGB to YCbCr Color Space Conversion for Embedded Application

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.2873 ◽

2014 ◽

Vol 543-547 ◽

pp. 2873-2878

Author(s):

Hui Yong Li ◽

Hong Xu Jiang ◽

Ping Zhang ◽

Han Qing Li ◽

Qian Cao

Keyword(s):

Direct Method ◽

Color Space ◽

Video Data ◽

Floating Point ◽

Portable Devices ◽

Embedded Processor ◽

Color Space Conversion ◽

Ycbcr Color Space ◽

Original Approach ◽

General Method

Modern embedded portable devices usually have to deal with large amounts of video data. Due to massive floating-point multiplications, the color space conversion is inefficient on the embedded processor. Considering the characteristics of RGB to YCbCr color space conversion, this paper proposed a strategy for truncated-based LUT Multiplier (T-LUT Multiplier). On this base, an original approach converting RGB to YCbCr is presented which employs the T-LUT Multiplier and the pipeline-based adder. Experimental results demonstrate that the proposed method can obtain maximum operating frequency of 358MHz, 3.5 times faster than the direct method. Furthermore, the power consumption is less than that of the general method approximately by 15%~27%.

Download Full-text

Classification of Action Based Video using Heterogeneous Feature Extraction and SVM

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2089.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 1887-1892

Keyword(s):

Optical Flow ◽

Video Sequence ◽

Human Action ◽

Video Data ◽

Support Vector ◽

Svm Classifier ◽

Video Frames ◽

Integral Role ◽

Heterogeneous Feature

Action recognition (AR) plays a fundamental role in computer vision and video analysis. We are witnessing an astronomical increase of video data on the web and it is difficult to recognize the action in video due to different view point of camera. For AR in video sequence, it depends upon appearance in frame and optical flow in frames of video. In video spatial and temporal components of video frames features play integral role for better classification of action in videos. In the proposed system, RGB frames and optical flow frames are used for AR with the help of Convolutional Neural Network (CNN) pre-trained model Alex-Net extract features from fc7 layer. Support vector machine (SVM) classifier is used for the classification of AR in videos. For classification purpose, HMDB51 dataset have been used which includes 51 Classes of human action. The dataset is divided into 51 action categories. Using SVM classifier, extracted features are used for classification and achieved best result 95.6% accuracy as compared to other techniques of the state-of- art.v

Download Full-text

AUTOMATIC MRF-BASED REGISTRATION OF HIGH RESOLUTION SATELLITE VIDEO DATA

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsannals-iii-1-121-2016 ◽

2016 ◽

Vol III-1 ◽

pp. 121-128 ◽

Cited By ~ 1

Author(s):

C. Platias ◽

M. Vakalopoulou ◽

K. Karantzalos

Keyword(s):

High Resolution ◽

Markov Random Fields ◽

Video Data ◽

Registration Accuracy ◽

Registration Method ◽

Computational Performance ◽

Markov Random ◽

Video Frames ◽

Reference Map ◽

The Cost

In this paper we propose a deformable registration framework for high resolution satellite video data able to automatically and accurately co-register satellite video frames and/or register them to a reference map/image. The proposed approach performs non-rigid registration, formulates a Markov Random Fields (MRF) model, while efficient linear programming is employed for reaching the lowest potential of the cost function. The developed approach has been applied and validated on satellite video sequences from Skybox Imaging and compared with a rigid, descriptor-based registration method. Regarding the computational performance, both the MRF-based and the descriptor-based methods were quite efficient, with the first one converging in some minutes and the second in some seconds. Regarding the registration accuracy the proposed MRF-based method significantly outperformed the descriptor-based one in all the performing experiments.

Download Full-text

k-SDPP: Fixed-Size Video Summarization via Sequential Determinantal Point Processes

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/108 ◽

2020 ◽

Author(s):

Jiping Zheng ◽

Ganfeng Lu

Keyword(s):

Point Processes ◽

Probabilistic Models ◽

Video Summarization ◽

Video Data ◽

Determinantal Point Processes ◽

Key Frame ◽

Video Frames ◽

Long Time ◽

Sequential Nature ◽

Markovian Assumption

With the explosive growth of video data, video summarization which converts long-time videos to key frame sequences has become an important task in information retrieval and machine learning. Determinantal point processes (DPPs) which are elegant probabilistic models have been successfully applied to video summarization. However, existing DPP-based video summarization methods suffer from poor efficiency of outputting a specified size summary or neglecting inherent sequential nature of videos. In this paper, we propose a new model in the DPP lineage named k-SDPP in vein of sequential determinantal point processes but with fixed user specified size k. Our k-SDPP partitions sampled frames of a video into segments where each segment is with constant number of video frames. Moreover, an efficient branch and bound method (BB) considering sequential nature of the frames is provided to optimally select k frames delegating the summary from the divided segments. Experimental results show that our proposed BB method outperforms not only k-DPP and sequential DPP (seqDPP) but also the partition and Markovian assumption based methods.

Download Full-text

Residual Invertible Spatio-Temporal Network for Video Super-Resolution

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015981 ◽

2019 ◽

Vol 33 ◽

pp. 5981-5988 ◽

Cited By ~ 12

Author(s):

Xiaobin Zhu ◽

Zhuangzi Li ◽

Xiao-Yu Zhang ◽

Changsheng Li ◽

Yaqi Liu ◽

...

Keyword(s):

Spatial Information ◽

Super Resolution ◽

Temporal Consistency ◽

Temporal Network ◽

Convolutional Network ◽

Feature Representations ◽

Video Frames ◽

Temporal Features ◽

Benchmark Datasets ◽

Spatio Temporal

Video super-resolution is a challenging task, which has attracted great attention in research and industry communities. In this paper, we propose a novel end-to-end architecture, called Residual Invertible Spatio-Temporal Network (RISTN) for video super-resolution. The RISTN can sufficiently exploit the spatial information from low-resolution to high-resolution, and effectively models the temporal consistency from consecutive video frames. Compared with existing recurrent convolutional network based approaches, RISTN is much deeper but more efficient. It consists of three major components: In the spatial component, a lightweight residual invertible block is designed to reduce information loss during feature transformation and provide robust feature representations. In the temporal component, a novel recurrent convolutional model with residual dense connections is proposed to construct deeper network and avoid feature degradation. In the reconstruction component, a new fusion method based on the sparse strategy is proposed to integrate the spatial and temporal features. Experiments on public benchmark datasets demonstrate that RISTN outperforms the state-ofthe-art methods.

Download Full-text

Optical Flow Estimation with Occlusion Detection

Algorithms ◽

10.3390/a12050092 ◽

2019 ◽

Vol 12 (5) ◽

pp. 92

Author(s):

Song Wang ◽

Zengfu Wang

Keyword(s):

Optical Flow ◽

Feature Matching ◽

Estimation Algorithm ◽

Occlusion Detection ◽

Multiple Sources ◽

Flow Estimation ◽

Optical Flow Estimation ◽

Dense Optical Flow ◽

Video Frames ◽

Optimization Function

The dense optical flow estimation under occlusion is a challenging task. Occlusion may result in ambiguity in optical flow estimation, while accurate occlusion detection can reduce the error. In this paper, we propose a robust optical flow estimation algorithm with reliable occlusion detection. Firstly, the occlusion areas in successive video frames are detected by integrating various information from multiple sources including feature matching, motion edges, warped images and occlusion consistency. Then optimization function with occlusion coefficient and selective region smoothing are used to obtain the optical flow estimation of the non-occlusion areas and occlusion areas respectively. Experimental results show that the algorithm proposed in this paper is an effective algorithm for dense optical flow estimation.

Download Full-text

A Deep Reinforcement Learning Approach for Ramp Metering Based on Traffic Video Data

Journal of Advanced Transportation ◽

10.1155/2021/6669028 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Bing Liu ◽

Yu Tang ◽

Yuxiong Ji ◽

Yu Shen ◽

Yuchuan Du

Keyword(s):

Reinforcement Learning ◽

Control Strategies ◽

Video Data ◽

Ramp Metering ◽

Traffic Information ◽

Video Frames ◽

Traffic Volumes ◽

Practice Method ◽

Vehicle Mobility ◽

Traffic Video

Ramp metering that uses traffic signals to regulate vehicle flows from the on-ramps has been widely implemented to improve vehicle mobility of the freeway. Previous studies generally update signal timings in real-time based on predefined traffic measurements collected by point detectors, such as traffic volumes and occupancies. Comparing with point detectors, traffic cameras—which have been increasingly deployed on road networks—could cover larger areas and provide more detailed traffic information. In this work, we propose a deep reinforcement learning (DRL) method to explore the potential of traffic video data in improving the efficiency of ramp metering. Vehicle locations are extracted from the traffic video frames and are reformed as position matrices. The proposed method takes the preprocessed video data as inputs and learns the optimal control strategies directly from the high-dimensional inputs. A series of simulation experiments based on real-world traffic data are conducted to evaluate the proposed approach. The results demonstrate that, in comparison with a state-of-the-practice method, the proposed DRL method results in (1) lower travel times in the mainline, (2) shorter vehicle queues at the on-ramp, and (3) higher traffic flows downstream of the merging area. The results suggest that the proposed method is able to extract useful information from the video data for better ramp metering controls.

Download Full-text

Video Smoothing of Aggregates of Streams with Bandwidth Constraints

Journal of Communications Software and Systems ◽

10.24138/jcomss.v1i1.318 ◽

2017 ◽

Vol 1 (1) ◽

pp. 57

Author(s):

Pietro Camarda ◽

Cataldo Guaragnella ◽

Domenico Striccoli

Keyword(s):

Video Transmission ◽

Video Data ◽

Multiple Time Scales ◽

Multiple Time ◽

Smoothing Algorithm ◽

Compressed Video ◽

Available Bandwidth ◽

Video Broadcasting ◽

High Data ◽

Video Frames

Compressed variable bit rate (VBR) video transmission is acquiring a growing importance in the telecommunication world. High data rate variability of compressed video over multiple time scales makes an efficient bandwidth resource utilization difficult to obtain. One of the approaches developed to face this problem are smoothing techniques. Various smoothing algorithms that exploit client buffers have been proposed, thus reducing the peak rate and highrate variability by efficiently scheduling the video data to be transmitted over the network. The novel smoothing algorithm proposed in this paper, which represents a significant improvements over the existing methods,performs data scheduling both for a single stream and for stream aggregations, by taking into account available bandwidth constraints. It modifies, whenever possible, the smoothing schedule in such a way as to eliminate frame losses due to available bandwidth limitations. This technique can be applied to any smoothing algorithm already present in literature and can be usefully exploited to minimize losses in multiplexed stream scenarios, like Terrestrial Digital Video Broadcasting (DVB-T), where a specific known available bandwidth must be shared byseveral multimedia flows. The developed algorithm has been exploited for smoothing stored video, although it can also be quite easily adapted for real time smoothing. The obtained numerical results, compared with the MVBA, another smoothing algorithm that is already presented and discussed in literature, show the effectiveness of the proposed algorithm, in terms of lost video frames, for different multiplexed scenarios.

Download Full-text

Content-Based Video Scene Clustering and Segmentation

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch010 ◽

2011 ◽

pp. 166-179

Author(s):

Hong Lu ◽

Xiangyang Xue

Keyword(s):

Large Scale ◽

Video Summarization ◽

Gaussian Mixture ◽

Visual Similarity ◽

Video Data ◽

Data Sets ◽

Scene Segmentation ◽

Clustering Methods ◽

Video Frames ◽

Video Scene

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.

Download Full-text