scholarly journals Feature Fusion of Deep Spatial Features and Handcrafted Spatiotemporal Features for Human Action Recognition

Sensors ◽  
2019 ◽  
Vol 19 (7) ◽  
pp. 1599 ◽  
Author(s):  
Md Uddin ◽  
Young-Koo Lee

Human action recognition plays a significant part in the research community due to its emerging applications. A variety of approaches have been proposed to resolve this problem, however, several issues still need to be addressed. In action recognition, effectively extracting and aggregating the spatial-temporal information plays a vital role to describe a video. In this research, we propose a novel approach to recognize human actions by considering both deep spatial features and handcrafted spatiotemporal features. Firstly, we extract the deep spatial features by employing a state-of-the-art deep convolutional network, namely Inception-Resnet-v2. Secondly, we introduce a novel handcrafted feature descriptor, namely Weber’s law based Volume Local Gradient Ternary Pattern (WVLGTP), which brings out the spatiotemporal features. It also considers the shape information by using gradient operation. Furthermore, Weber’s law based threshold value and the ternary pattern based on an adaptive local threshold is presented to effectively handle the noisy center pixel value. Besides, a multi-resolution approach for WVLGTP based on an averaging scheme is also presented. Afterward, both these extracted features are concatenated and feed to the Support Vector Machine to perform the classification. Lastly, the extensive experimental analysis shows that our proposed method outperforms state-of-the-art approaches in terms of accuracy.

Sensors ◽  
2020 ◽  
Vol 20 (24) ◽  
pp. 7299
Author(s):  
Chirag I. Patel ◽  
Dileep Labana ◽  
Sharnil Pandya ◽  
Kirit Modi ◽  
Hemant Ghayvat ◽  
...  

Human Action Recognition (HAR) is the classification of an action performed by a human. The goal of this study was to recognize human actions in action video sequences. We present a novel feature descriptor for HAR that involves multiple features and combining them using fusion technique. The major focus of the feature descriptor is to exploits the action dissimilarities. The key contribution of the proposed approach is to built robust features descriptor that can work for underlying video sequences and various classification models. To achieve the objective of the proposed work, HAR has been performed in the following manner. First, moving object detection and segmentation are performed from the background. The features are calculated using the histogram of oriented gradient (HOG) from a segmented moving object. To reduce the feature descriptor size, we take an averaging of the HOG features across non-overlapping video frames. For the frequency domain information we have calculated regional features from the Fourier hog. Moreover, we have also included the velocity and displacement of moving object. Finally, we use fusion technique to combine these features in the proposed work. After a feature descriptor is prepared, it is provided to the classifier. Here, we have used well-known classifiers such as artificial neural networks (ANNs), support vector machine (SVM), multiple kernel learning (MKL), Meta-cognitive Neural Network (McNN), and the late fusion methods. The main objective of the proposed approach is to prepare a robust feature descriptor and to show the diversity of our feature descriptor. Though we are using five different classifiers, our feature descriptor performs relatively well across the various classifiers. The proposed approach is performed and compared with the state-of-the-art methods for action recognition on two publicly available benchmark datasets (KTH and Weizmann) and for cross-validation on the UCF11 dataset, HMDB51 dataset, and UCF101 dataset. Results of the control experiments, such as a change in the SVM classifier and the effects of the second hidden layer in ANN, are also reported. The results demonstrate that the proposed method performs reasonably compared with the majority of existing state-of-the-art methods, including the convolutional neural network-based feature extractors.


2018 ◽  
Vol 7 (2.20) ◽  
pp. 207 ◽  
Author(s):  
K Rajendra Prasad ◽  
P Srinivasa Rao

Human action recognition from 2D videos is a demanding area due to its broad applications. Many methods have been proposed by the researchers for recognizing human actions. The improved accuracy in identifying human actions is desirable. This paper presents an improved method of human action recognition using support vector machine (SVM) classifier. This paper proposes a novel feature descriptor constructed by fusing the various investigated features. The handcrafted features such as scale invariant feature transform (SIFT) features, speed up robust features (SURF), histogram of oriented gradient (HOG) features and local binary pattern (LBP) features are obtained on online 2D action videos. The proposed method is tested on different action datasets having both static and dynamically varying backgrounds. The proposed method achieves shows best recognition rates on both static and dynamically varying backgrounds. The datasets considered for the experimentation are KTH, Weizmann, UCF101, UCF sports actions, MSR action and HMDB51.The performance of the proposed feature fusion model with SVM classifier is compared with the individual features with SVM. The fusion method showed best results. The efficiency of the classifier is also tested by comparing with the other state of the art classifiers such as k-nearest neighbors (KNN), artificial neural network (ANN) and Adaboost classifier. The method achieved an average of 94.41% recognition rate.  


Author(s):  
L. Nirmala Devi ◽  
A.Nageswar Rao

Human action recognition (HAR) is one of most significant research topics, and it has attracted the concentration of many researchers. Automatic HAR system is applied in several fields like visual surveillance, data retrieval, healthcare, etc. Based on this inspiration, in this chapter, the authors propose a new HAR model that considers an image as input and analyses and exposes the action present in it. Under the analysis phase, they implement two different feature extraction methods with the help of rotation invariant Gabor filter and edge adaptive wavelet filter. For every action image, a new vector called as composite feature vector is formulated and then subjected to dimensionality reduction through principal component analysis (PCA). Finally, the authors employ the most popular supervised machine learning algorithm (i.e., support vector machine [SVM]) for classification. Simulation is done over two standard datasets; they are KTH and Weizmann, and the performance is measured through an accuracy metric.


Author(s):  
Xueping Liu ◽  
Xingzuo Yue

The kernel function has been successfully utilized in the extreme learning machine (ELM) that provides a stabilized and generalized performance and greatly reduces the computational complexity. However, the selection and optimization of the parameters constituting the most common kernel functions are tedious and time-consuming. In this study, a set of new Hermit kernel functions derived from the generalized Hermit polynomials has been proposed. The significant contributions of the proposed kernel include only one parameter selected from a small set of natural numbers; thus, the parameter optimization is greatly facilitated and excessive structural information of the sample data is retained. Consequently, the new kernel functions can be used as optimal alternatives to other common kernel functions for ELM at a rapid learning speed. The experimental results showed that the proposed kernel ELM method tends to have similar or better robustness and generalized performance at a faster learning speed than the other common kernel ELM and support vector machine methods. Consequently, when applied to human action recognition by depth video sequence, the method also achieves excellent performance, demonstrating its time-based advantage on the video image data.


2019 ◽  
Vol 9 (10) ◽  
pp. 2126 ◽  
Author(s):  
Suge Dong ◽  
Daidi Hu ◽  
Ruijun Li ◽  
Mingtao Ge

Aimed at the problems of high redundancy of trajectory and susceptibility to background interference in traditional dense trajectory behavior recognition methods, a human action recognition method based on foreground trajectory and motion difference descriptors is proposed. First, the motion magnitude of each frame is estimated by optical flow, and the foreground region is determined according to each motion magnitude of the pixels; the trajectories are only extracted from behavior-related foreground regions. Second, in order to better describe the relative temporal information between different actions, a motion difference descriptor is introduced to describe the foreground trajectory, and the direction histogram of the motion difference is constructed by calculating the direction information of the motion difference per unit time of the trajectory point. Finally, a Fisher vector (FV) is used to encode histogram features to obtain video-level action features, and a support vector machine (SVM) is utilized to classify the action category. Experimental results show that this method can better extract the action-related trajectory, and it can improve the recognition accuracy by 7% compared to the traditional dense trajectory method.


Author(s):  
Jiajia Luo ◽  
Wei Wang ◽  
Hairong Qi

Multi-view human action recognition has gained a lot of attention in recent years for its superior performance as compared to single view recognition. In this paper, we propose a new framework for the real-time realization of human action recognition in distributed camera networks (DCNs). We first present a new feature descriptor (Mltp-hist) that is tolerant to illumination change, robust in homogeneous region and computationally efficient. Taking advantage of the proposed Mltp-hist, the noninformative 3-D patches generated from the background can be further removed automatically that effectively highlights the foreground patches. Next, a new feature representation method based on sparse coding is presented to generate the histogram representation of local videos to be transmitted to the base station for classification. Due to the sparse representation of extracted features, the approximation error is reduced. Finally, at the base station, a probability model is produced to fuse the information from various views and a class label is assigned accordingly. Compared to the existing algorithms, the proposed framework has three advantages while having less requirements on memory and bandwidth consumption: 1) no preprocessing is required; 2) communication among cameras is unnecessary; and 3) positions and orientations of cameras do not need to be fixed. We further evaluate the proposed framework on the most popular multi-view action dataset IXMAS. Experimental results indicate that our proposed framework repeatedly achieves state-of-the-art results when various numbers of views are tested. In addition, our approach is tolerant to the various combination of views and benefit from introducing more views at the testing stage. Especially, our results are still satisfactory even when large misalignment exists between the training and testing samples.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Qingwu Li ◽  
Haisu Cheng ◽  
Yan Zhou ◽  
Guanying Huo

Human action recognition in videos is a topic of active research in computer vision. Dense trajectory (DT) features were shown to be efficient for representing videos in state-of-the-art approaches. In this paper, we present a more effective approach of video representation using improved salient dense trajectories: first, detecting the motion salient region and extracting the dense trajectories by tracking interest points in each spatial scale separately and then refining the dense trajectories via the analysis of the motion saliency. Then, we compute several descriptors (i.e., trajectory displacement, HOG, HOF, and MBH) in the spatiotemporal volume aligned with the trajectories. Finally, in order to represent the videos better, we optimize the framework of bag-of-words according to the motion salient intensity distribution and the idea of sparse coefficient reconstruction. Our architecture is trained and evaluated on the four standard video actions datasets of KTH, UCF sports, HMDB51, and UCF50, and the experimental results show that our approach performs competitively comparing with the state-of-the-art results.


Sign in / Sign up

Export Citation Format

Share Document