ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

Download Full-text

Deep Learning for Human Action Recognition with Convolution Neural Network

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206466 ◽

2020 ◽

pp. 376-380

Author(s):

S. Karthickkumar ◽

K. Kumar

Keyword(s):

Neural Network ◽

Health Care ◽

Deep Learning ◽

Action Recognition ◽

Human Activities ◽

Recognition Performance ◽

Human Action Recognition ◽

Human Action ◽

Two Dimensional ◽

Consumer Behavior Analysis

In recent years, deep learning for human action recognition is one of the most popular researches. It has a variety of applications such as surveillance, health care, and consumer behavior analysis, robotics. In this paper to propose a Two-Dimensional (2D) Convolutional Neural Network for recognizing Human Activities. Here the WISDM dataset is used to tarin and test the data. It can have the Activities like sitting, standing and downstairs, upstairs, running. The human activity recognition performance of our 2D-CNN based method which shows 93.17% accuracy.

Download Full-text

Convolutional and Recurrent Neural Network for Human Action Recognition: application on American Sign Language

10.1101/535492 ◽

2019 ◽

Author(s):

Hernandez Vincent ◽

Suzuki Tomoya ◽

Venture Gentiane

Keyword(s):

Neural Network ◽

American Sign Language ◽

Sign Language ◽

Action Recognition ◽

Recurrent Neural Network ◽

Data Augmentation ◽

Human Action Recognition ◽

Human Action ◽

American Sign ◽

Test Set

AbstractHuman Action Recognition (HAR) is an important and difficult topic because of the important variability between tasks repeated several times by a subject and between subjects. This work is motivated by providing time-series signal classification and a robust validation and test approaches. This study proposes to classify 60 American Sign Language signs from data provided by the LeapMotion sensor by using a combined approach with Convolutional Neural Network (ConvNet) and Recurrent Neural Network with Long-Short Term Memory cells (LSTM) called ConvNet-LSTM. Moreover, a complete kinematic model of the right and left forearm/hand/fingers/thumb is proposed as well as the use of a simple data augmentation technique to improve the generalization of neural networks. Results showed an accuracy of 89.3% on a user-independent test set with data augmentation when using the ConvNet-LSTM, while LSTM alone provided an accuracy of 85.0% on the same test set. The result dropped respectively to 85.9% and 81.4% without data augmentation.

Download Full-text

End-to-end learning of deep convolutional neural network for 3D human action recognition

2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) ◽

10.1109/icmew.2017.8026281 ◽

2017 ◽

Author(s):

Chao Li ◽

Shouqian Sun ◽

Xin Min ◽

Wenqian Lin ◽

Binling Nie ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Deep Convolutional Neural Network ◽

End To End

Download Full-text

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Multimedia Tools and Applications ◽

10.1007/s11042-018-5893-9 ◽

2018 ◽

Vol 77 (20) ◽

pp. 26901-26918 ◽

Cited By ~ 8

Author(s):

Bo Meng ◽

XueJun Liu ◽

Xiaolin Wang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action

Download Full-text

DMMs-Based Multiple Features Fusion for Human Action Recognition

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015100102 ◽

2015 ◽

Vol 6 (4) ◽

pp. 23-39 ◽

Cited By ~ 18

Author(s):

Mohammad Farhad Bulbul ◽

Yunsheng Jiang ◽

Jinwen Ma

Keyword(s):

Action Recognition ◽

Recognition Performance ◽

Recognition Task ◽

Human Action Recognition ◽

Fusion Rule ◽

Local Binary Patterns ◽

Human Action ◽

Decision Fusion ◽

Soft Decision ◽

Depth Sensors

The emerging cost-effective depth sensors have facilitated the action recognition task significantly. In this paper, the authors address the action recognition problem using depth video sequences combining three discriminative features. More specifically, the authors generate three Depth Motion Maps (DMMs) over the entire video sequence corresponding to the front, side, and top projection views. Contourlet-based Histogram of Oriented Gradients (CT-HOG), Local Binary Patterns (LBP), and Edge Oriented Histograms (EOH) are then computed from the DMMs. To merge these features, the authors consider decision-level fusion, where a soft decision-fusion rule, Logarithmic Opinion Pool (LOGP), is used to combine the classification outcomes from multiple classifiers each with an individual set of features. Experimental results on two datasets reveal that the fusion scheme achieves superior action recognition performance over the situations when using each feature individually.

Download Full-text

I3D-Shufflenet Based Human Action Recognition

Algorithms ◽

10.3390/a13110301 ◽

2020 ◽

Vol 13 (11) ◽

pp. 301

Author(s):

Guocheng Liu ◽

Caixia Zhang ◽

Qingyang Xu ◽

Ruoshi Cheng ◽

Yong Song ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Recognition Algorithm ◽

Convolution Kernel ◽

Histogram Of Oriented Gradients ◽

Temporal Features ◽

Convolution Kernels

In view of difficulty in application of optical flow based human action recognition due to large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5 convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and classification of human action is performed based on trained I3D-shufflenet model. The experimental results show that the shuffle layer improves the composition of features in each channel which can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG) spatial-temporal features of the object are extracted for training, which can significantly improve the ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet is testified on the UCF101 dataset, and compared with other models. The final result shows that the I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.

Download Full-text

Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling

Data ◽

10.3390/data5040104 ◽

2020 ◽

Vol 5 (4) ◽

pp. 104

Author(s):

Ashok Sarabu ◽

Ajit Kumar Santra

Keyword(s):

Action Recognition ◽

Data Augmentation ◽

Main Idea ◽

Human Action Recognition ◽

Human Action ◽

Great Success ◽

Temporal Modeling ◽

Convolutional Networks ◽

Temporal Features ◽

Augmentation Techniques

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.

Download Full-text