scholarly journals Action recognition based on 2D skeletons extracted from RGB videos

2019 ◽  
Vol 277 ◽  
pp. 02034
Author(s):  
Sophie Aubry ◽  
Sohaib Laraba ◽  
Joëlle Tilmanne ◽  
Thierry Dutoit

In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.

Mathematics ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 624
Author(s):  
Stefan Rohrmanstorfer ◽  
Mikhail Komarov ◽  
Felix Mödritscher

With the always increasing amount of image data, it has become a necessity to automatically look for and process information in these images. As fashion is captured in images, the fashion sector provides the perfect foundation to be supported by the integration of a service or application that is built on an image classification model. In this article, the state of the art for image classification is analyzed and discussed. Based on the elaborated knowledge, four different approaches will be implemented to successfully extract features out of fashion data. For this purpose, a human-worn fashion dataset with 2567 images was created, but it was significantly enlarged by the performed image operations. The results show that convolutional neural networks are the undisputed standard for classifying images, and that TensorFlow is the best library to build them. Moreover, through the introduction of dropout layers, data augmentation and transfer learning, model overfitting was successfully prevented, and it was possible to incrementally improve the validation accuracy of the created dataset from an initial 69% to a final validation accuracy of 84%. More distinct apparel like trousers, shoes and hats were better classified than other upper body clothes.


Author(s):  
Séamus Lankford ◽  
◽  
Diarmuid Grimes

The training and optimization of neural networks, using pre-trained, super learner and ensemble approaches is explored. Neural networks, and in particular Convolutional Neural Networks (CNNs), are often optimized using default parameters. Neural Architecture Search (NAS) enables multiple architectures to be evaluated prior to selection of the optimal architecture. Our contribution is to develop, and make available to the community, a system that integrates open source tools for the neural architecture search (OpenNAS) of image classification models. OpenNAS takes any dataset of grayscale, or RGB images, and generates the optimal CNN architecture. Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO) and pre-trained models serve as base learners for ensembles. Meta learner algorithms are subsequently applied to these base learners and the ensemble performance on image classification problems is evaluated. Our results show that a stacked generalization ensemble of heterogeneous models is the most effective approach to image classification within OpenNAS.


2021 ◽  
Author(s):  
Fakhrul Aniq Hakimi Nasrul ’Alam ◽  
Mohd. Ibrahim Shapiai ◽  
Uzma Batool ◽  
Ahmad Kamal Ramli ◽  
Khairil Ashraf Elias

Recognition of human behavior is critical in video monitoring, human-computer interaction, video comprehension, and virtual reality. The key problem with behaviour recognition in video surveillance is the high degree of variation between and within subjects. Numerous studies have suggested background-insensitive skeleton-based as the proven detection technique. The present state-of-the-art approaches to skeleton-based action recognition rely primarily on Recurrent Neural Networks (RNN) and Convolution Neural Networks (CNN). Both methods take dynamic human skeleton as the input to the network. We chose to handle skeleton data differently, relying solely on its skeleton joint coordinates as the input. The skeleton joints’ positions are defined in (x, y) coordinates. In this paper, we investigated the incorporation of the Neural Oblivious Decision Ensemble (NODE) into our proposed action classifier network. The skeleton is extracted using a pose estimation technique based on the Residual Network (ResNet). It extracts the 2D skeleton of 18 joints for each detected body. The joint coordinates of the skeleton are stored in a table in the form of rows and columns. Each row represents the position of the joints. The structured data are fed into NODE for label prediction. With the proposed network, we obtain 97.5% accuracy on RealWorld (HAR) dataset. Experimental results show that the proposed network outperforms one the state-of-the-art approaches by 1.3%. In conclusion, NODE is a promising deep learning technique for structured data analysis as compared to its machine learning counterparts such as the GBDT packages; Catboost, and XGBoost.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Wanheng Liu ◽  
Ling Yin ◽  
Cong Wang ◽  
Fulin Liu ◽  
Zhiyu Ni

In this paper, a novel medical knowledge graph in Chinese approach applied in smart healthcare based on IoT and WoT is presented, using deep neural networks combined with self-attention to generate medical knowledge graph to make it more convenient for performing disease diagnosis and providing treatment advisement. Although great success has been made in the medical knowledge graph in recent studies, the issue of comprehensive medical knowledge graph in Chinese appropriate for telemedicine or mobile devices have been ignored. In our study, it is a working theory which is based on semantic mobile computing and deep learning. When several experiments have been carried out, it is demonstrated that it has better performance in generating various types of medical knowledge graph in Chinese, which is similar to that of the state-of-the-art. Also, it works well in the accuracy and comprehensive, which is much higher and highly consisted with the predictions of the theoretical model. It proves to be inspiring and encouraging that our work involving studies of medical knowledge graph in Chinese, which can stimulate the smart healthcare development.


Author(s):  
Bogdan Alexandru Radulescu ◽  
Victorita Radulescu

Abstract Action Recognition is a domain that gains interest along with the development of specific motion capture equipment, hardware and power of processing. Its many applications in domains such as national security and behavior analysis make it even more popular among the scientific community, especially considering the ascending trend of machine learning methods. Nowadays approaches necessary to solve real life problems through human actions recognition became more interesting. To solve this problem are mainly two approaches when attempting to build a classifier, either using RGB images or sensor data, or where possible a combination of these two. Both methods have advantages and disadvantages and domains of utilization in real life problems, solvable through actions recognition. Using RGB input makes it possible to adopt a classifier on almost any infrastructure without specialized equipment, whereas combining video with sensor data provides a higher accuracy, albeit at a higher cost. Neural networks and especially convolutional neural networks are the starting point for human action recognition. By their nature, they can recognize very well spatial and temporal features, making them ideal for RGB images or sequences of RGB images. In the present paper is proposed the convolutional neural network architecture based on 2D kernels. Its structure, along with metrics measuring the performance, advantages and disadvantages are here illustrated. This solution based on 2D convolutions is fast, but has lower performance compared to other known solutions. The main problem when dealing with videos is the context extraction from a sequence of frames. Video classification using 2D Convolutional Layers is realized either by the most significant frame or by frame to frame, applying a probability distribution over the partial classes to obtain the final prediction. To classify actions, especially when differences between them are subtle, and consists of only a small part of the overall image is difficult. When classifying via the key frames, the total accuracy obtained is around 10%. The other approach, classifying each frame individually, proved to be too computationally expensive with negligible gains.


Author(s):  
Tuan Hoang ◽  
Thanh-Toan Do ◽  
Tam V. Nguyen ◽  
Ngai-Man Cheung

This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations. First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights. However, this approach would result in some mismatch: the gradient descent updates full-precision weights, but it does not update the quantized weights. To address this issue, we propose a novel method that enables direct updating of quantized weights with learnable quantization levels to minimize the cost function using gradient descent. Second, to obtain low bit-width activations, existing works consider all channels equally. However, the activation quantizers could be biased toward a few channels with high-variance. To address this issue, we propose a method to take into account the quantization errors of individual channels. With this approach, we can learn activation quantizers that minimize the quantization errors in the majority of channels. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the image classification task, using AlexNet, ResNet and MobileNetV2 architectures on CIFAR-100 and ImageNet datasets.


2020 ◽  
Vol 34 (10) ◽  
pp. 13943-13944
Author(s):  
Kira Vinogradova ◽  
Alexandr Dibrov ◽  
Gene Myers

Convolutional neural networks have become state-of-the-art in a wide range of image recognition tasks. The interpretation of their predictions, however, is an active area of research. Whereas various interpretation methods have been suggested for image classification, the interpretation of image segmentation still remains largely unexplored. To that end, we propose seg-grad-cam, a gradient-based method for interpreting semantic segmentation. Our method is an extension of the widely-used Grad-CAM method, applied locally to produce heatmaps showing the relevance of individual pixels for semantic segmentation.


Author(s):  
Chongwen Liu ◽  
Zhaowei Shang ◽  
Bo Lin ◽  
Yuan Yan Tang

The multi-task learning (MTL) methods consider learning a problem together with other related problems simultaneously. The major challenge of MTL is how to selectively screen the shared information. The information of each task must be related to the others, but when sharing information between two unrelated tasks it degenerates the performance of both tasks. To ensure the related problems are related to the main task is the most important point in MTL. In this paper, we will design a novel algorithm to calculate the degrees of relationship among tasks by using a semantical space of features in each task and then build semantical tree to achieve better learning performance. We propose an MTL method under this algorithm which achieves good experimental performance. Our experiments are taken on both image classification and video action recognition, compared with the state-of-the-art MTL methods. Our method proposes good performance in the four public datasets.


2021 ◽  
Vol 11 (24) ◽  
pp. 11957
Author(s):  
Andrea Agiollo ◽  
Andrea Omicini

The application of Artificial Intelligence to the industrial world and its appliances has recently grown in popularity. Indeed, AI techniques are now becoming the de-facto technology for the resolution of complex tasks concerning computer vision, natural language processing and many other areas. In the last years, most of the the research community efforts have focused on increasing the performance of most common AI techniques—e.g., Neural Networks, etc.—at the expenses of their complexity. Indeed, many works in the AI field identify and propose hyper-efficient techniques, targeting high-end devices. However, the application of such AI techniques to devices and appliances which are characterised by limited computational capabilities, remains an open research issue. In the industrial world, this problem heavily targets low-end appliances, which are developed focusing on saving costs and relying on—computationally—constrained components. While some efforts have been made in this area through the proposal of AI-simplification and AI-compression techniques, it is still relevant to study which available AI techniques can be used in modern constrained devices. Therefore, in this paper we propose a load classification task as a case study to analyse which state-of-the-art NN solutions can be embedded successfully into constrained industrial devices. The presented case study is tested on a simple microcontroller, characterised by very poor computational performances—i.e., FLOPS –, to mirror faithfully the design process of low-end appliances. A handful of NN models are tested, showing positive outcomes and possible limitations, and highlighting the complexity of AI embedding.


Entropy ◽  
2021 ◽  
Vol 23 (12) ◽  
pp. 1692
Author(s):  
Lei Zhao ◽  
Mingcheng Zhang ◽  
Hongwei Ding ◽  
Xiaohui Cui

Significant progress has been made in generating counterfeit images and videos. Forged videos generated by deepfaking have been widely spread and have caused severe societal impacts, which stir up public concern about automatic deepfake detection technology. Recently, many deepfake detection methods based on forged features have been proposed. Among the popular forged features, textural features are widely used. However, most of the current texture-based detection methods extract textures directly from RGB images, ignoring the mature spectral analysis methods. Therefore, this research proposes a deepfake detection network fusing RGB features and textural information extracted by neural networks and signal processing methods, namely, MFF-Net. Specifically, it consists of four key components: (1) a feature extraction module to further extract textural and frequency information using the Gabor convolution and residual attention blocks; (2) a texture enhancement module to zoom into the subtle textural features in shallow layers; (3) an attention module to force the classifier to focus on the forged part; (4) two instances of feature fusion to firstly fuse textural features from the shallow RGB branch and feature extraction module and then to fuse the textural features and semantic information. Moreover, we further introduce a new diversity loss to force the feature extraction module to learn features of different scales and directions. The experimental results show that MFF-Net has excellent generalization and has achieved state-of-the-art performance on various deepfake datasets.


Sign in / Sign up

Export Citation Format

Share Document