scholarly journals Video-Based Person Re-Identification by an End-To-End Learning Architecture with Hybrid Deep Appearance-Temporal Feature

Sensors ◽  
2018 ◽  
Vol 18 (11) ◽  
pp. 3669 ◽  
Author(s):  
Rui Sun ◽  
Qiheng Huang ◽  
Miaomiao Xia ◽  
Jun Zhang

Video-based person re-identification is an important task with the challenges of lighting variation, low-resolution images, background clutter, occlusion, and human appearance similarity in the multi-camera visual sensor networks. In this paper, we propose a video-based person re-identification method called the end-to-end learning architecture with hybrid deep appearance-temporal feature. It can learn the appearance features of pivotal frames, the temporal features, and the independent distance metric of different features. This architecture consists of two-stream deep feature structure and two Siamese networks. For the first-stream structure, we propose the Two-branch Appearance Feature (TAF) sub-structure to obtain the appearance information of persons, and used one of the two Siamese networks to learn the similarity of appearance features of a pairwise person. To utilize the temporal information, we designed the second-stream structure that consisting of the Optical flow Temporal Feature (OTF) sub-structure and another Siamese network, to learn the person’s temporal features and the distances of pairwise features. In addition, we select the pivotal frames of video as inputs to the Inception-V3 network on the Two-branch Appearance Feature sub-structure, and employ the salience-learning fusion layer to fuse the learned global and local appearance features. Extensive experimental results on the PRID2011, iLIDS-VID, and Motion Analysis and Re-identification Set (MARS) datasets showed that the respective proposed architectures reached 79%, 59% and 72% at Rank-1 and had advantages over state-of-the-art algorithms. Meanwhile, it also improved the feature representation ability of persons.

2018 ◽  
pp. 2083-2101
Author(s):  
Masaki Takahashi ◽  
Masahide Naemura ◽  
Mahito Fujii ◽  
James J. Little

A feature-representation method for recognizing actions in sports videos on the basis of the relationship between human actions and camera motions is proposed. The method involves the following steps: First, keypoint trajectories are extracted as motion features in spatio-temporal sub-regions called “spatio-temporal multiscale bags” (STMBs). Global representations and local representations from one sub-region in the STMBs are then combined to create a “glocal pairwise representation” (GPR). The GPR considers the co-occurrence of camera motions and human actions. Finally, two-stage SVM classifiers are trained with STMB-based GPRs, and specified human actions in video sequences are identified. An experimental evaluation of the recognition accuracy of the proposed method (by using the public OSUPEL basketball video dataset and broadcast videos) demonstrated that the method can robustly detect specific human actions in both public and broadcast basketball video sequences.


Sensors ◽  
2020 ◽  
Vol 20 (6) ◽  
pp. 1556 ◽  
Author(s):  
Zhenyu Li ◽  
Aiguo Zhou ◽  
Yong Shen

Scene recognition is an essential part in the vision-based robot navigation domain. The successful application of deep learning technology has triggered more extensive preliminary studies on scene recognition, which all use extracted features from networks that are trained for recognition tasks. In the paper, we interpret scene recognition as a region-based image retrieval problem and present a novel approach for scene recognition with an end-to-end trainable Multi-column convolutional neural network (MCNN) architecture. The proposed MCNN utilizes filters with receptive fields of different sizes to have Multi-level and Multi-layer image perception, and consists of three components: front-end, middle-end and back-end. The first seven layers VGG16 are taken as front-end for two-dimensional feature extraction, Inception-A is taken as the middle-end for deeper learning feature representation, and Large-Margin Softmax Loss (L-Softmax) is taken as the back-end for enhancing intra-class compactness and inter-class-separability. Extensive experiments have been conducted to evaluate the performance according to compare our proposed network to existing state-of-the-art methods. Experimental results on three popular datasets demonstrate the robustness and accuracy of our approach. To the best of our knowledge, the presented approach has not been applied for the scene recognition in literature.


Symmetry ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 52 ◽  
Author(s):  
Xianzhang Pan ◽  
Wenping Guo ◽  
Xiaoying Guo ◽  
Wenshu Li ◽  
Junjie Xu ◽  
...  

The proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies.


Sensors ◽  
2018 ◽  
Vol 19 (1) ◽  
pp. 56 ◽  
Author(s):  
Jianhai Zhang ◽  
Zhiyong Feng ◽  
Yong Su ◽  
Meng Xing ◽  
Wanli Xue

Individual recognition based on skeletal sequence is a challenging computer vision task with multiple important applications, such as public security, human–computer interaction, and surveillance. However, much of the existing work usually fails to provide any explicit quantitative differences between different individuals. In this paper, we propose a novel 3D spatio-temporal geometric feature representation of locomotion on Riemannian manifold, which explicitly reveals the intrinsic differences between individuals. To this end, we construct mean sequence by aligning related motion sequences on the Riemannian manifold. The differences in respect to this mean sequence are modeled as spatial state descriptors. Subsequently, a temporal hierarchy of covariance are imposed on the state descriptors, making it a higher-order statistical spatio-temporal feature representation, showing unique biometric characteristics for individuals. Finally, we introduce a kernel metric learning method to improve the classification accuracy. We evaluated our method on two public databases: the CMU Mocap database and the UPCV Gait database. Furthermore, we also constructed a new database for evaluating running and analyzing two major influence factors of walking. As a result, the proposed approach achieves promising results in all experiments.


Author(s):  
Chenyang Li ◽  
Xin Zhang ◽  
Lufan Liao ◽  
Lianwen Jin ◽  
Weixin Yang

The skeleton based gesture recognition is gaining more popularity due to its wide possible applications. The key issues are how to extract discriminative features and how to design the classification model. In this paper, we first leverage a robust feature descriptor, path signature (PS), and propose three PS features to explicitly represent the spatial and temporal motion characteristics, i.e., spatial PS (S PS), temporal PS (T PS) and temporal spatial PS (T S PS). Considering the significance of fine hand movements in the gesture, we propose an ”attention on hand” (AOH) principle to define joint pairs for the S PS and select single joint for the T PS. In addition, the dyadic method is employed to extract the T PS and T S PS features that encode global and local temporal dynamics in the motion. Secondly, without the recurrent strategy, the classification model still faces challenges on temporal variation among different sequences. We propose a new temporal transformer module (TTM) that can match the sequence key frames by learning the temporal shifting parameter for each input. This is a learning-based module that can be included into standard neural network architecture. Finally, we design a multi-stream fully connected layer based network to treat spatial and temporal features separately and fused them together for the final result. We have tested our method on three benchmark gesture datasets, i.e., ChaLearn 2016, ChaLearn 2013 and MSRC-12. Experimental results demonstrate that we achieve the state-of-the-art performance on skeleton-based gesture recognition with high computational efficiency.


2019 ◽  
Vol 11 (11) ◽  
pp. 1382 ◽  
Author(s):  
Daifeng Peng ◽  
Yongjun Zhang ◽  
Haiyan Guan

Change detection (CD) is essential to the accurate understanding of land surface changes using available Earth observation data. Due to the great advantages in deep feature representation and nonlinear problem modeling, deep learning is becoming increasingly popular to solve CD tasks in remote-sensing community. However, most existing deep learning-based CD methods are implemented by either generating difference images using deep features or learning change relations between pixel patches, which leads to error accumulation problems since many intermediate processing steps are needed to obtain final change maps. To address the above-mentioned issues, a novel end-to-end CD method is proposed based on an effective encoder-decoder architecture for semantic segmentation named UNet++, where change maps could be learned from scratch using available annotated datasets. Firstly, co-registered image pairs are concatenated as an input for the improved UNet++ network, where both global and fine-grained information can be utilized to generate feature maps with high spatial accuracy. Then, the fusion strategy of multiple side outputs is adopted to combine change maps from different semantic levels, thereby generating a final change map with high accuracy. The effectiveness and reliability of our proposed CD method are verified on very-high-resolution (VHR) satellite image datasets. Extensive experimental results have shown that our proposed approach outperforms the other state-of-the-art CD methods.


Author(s):  
C. Indhumathi ◽  
V. Murugan ◽  
G. Muthulakshmii

Nowadays, action recognition has gained more attention from the computer vision community. Normally for recognizing human actions, spatial and temporal features are extracted. Two-stream convolutional neural network is used commonly for human action recognition in videos. In this paper, Adaptive motion Attentive Correlated Temporal Feature (ACTF) is used for temporal feature extractor. The temporal average pooling in inter-frame is used for extracting the inter-frame regional correlation feature and mean feature. This proposed method has better accuracy of 96.9% for UCF101 and 74.6% for HMDB51 datasets, respectively, which are higher than the other state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document