Action recognition based on 2D skeletons extracted from RGB videos

In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.

Download Full-text

Image Classification for the Automatic Feature Extraction in Human Worn Fashion Data

Mathematics ◽

10.3390/math9060624 ◽

2021 ◽

Vol 9 (6) ◽

pp. 624

Author(s):

Stefan Rohrmanstorfer ◽

Mikhail Komarov ◽

Felix Mödritscher

Keyword(s):

Neural Networks ◽

Feature Extraction ◽

Image Classification ◽

Convolutional Neural Networks ◽

Data Augmentation ◽

State Of The Art ◽

Image Data ◽

Classification Model ◽

Upper Body ◽

Automatic Feature Extraction

With the always increasing amount of image data, it has become a necessity to automatically look for and process information in these images. As fashion is captured in images, the fashion sector provides the perfect foundation to be supported by the integration of a service or application that is built on an image classification model. In this article, the state of the art for image classification is analyzed and discussed. Based on the elaborated knowledge, four different approaches will be implemented to successfully extract features out of fashion data. For this purpose, a human-worn fashion dataset with 2567 images was created, but it was significantly enlarged by the performed image operations. The results show that convolutional neural networks are the undisputed standard for classifying images, and that TensorFlow is the best library to build them. Moreover, through the introduction of dropout layers, data augmentation and transfer learning, model overfitting was successfully prevented, and it was possible to incrementally improve the validation accuracy of the created dataset from an initial 69% to a final validation accuracy of 84%. More distinct apparel like trousers, shoes and hats were better classified than other upper body clothes.

Download Full-text

Open-Source Neural Architecture Search with Ensemble and Pre-trained Networks

International Journal of Modeling and Optimization ◽

10.7763/ijmo.2021.v11.774 ◽

2021 ◽

pp. 33-41

Author(s):

Séamus Lankford ◽

◽

Diarmuid Grimes

Keyword(s):

Neural Networks ◽

Image Classification ◽

Open Source ◽

Classification Models ◽

Classification Problems ◽

Neural Architecture ◽

Ensemble Performance ◽

Heterogeneous Models ◽

Rgb Images ◽

Selection Of

The training and optimization of neural networks, using pre-trained, super learner and ensemble approaches is explored. Neural networks, and in particular Convolutional Neural Networks (CNNs), are often optimized using default parameters. Neural Architecture Search (NAS) enables multiple architectures to be evaluated prior to selection of the optimal architecture. Our contribution is to develop, and make available to the community, a system that integrates open source tools for the neural architecture search (OpenNAS) of image classification models. OpenNAS takes any dataset of grayscale, or RGB images, and generates the optimal CNN architecture. Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO) and pre-trained models serve as base learners for ensembles. Meta learner algorithms are subsequently applied to these base learners and the ensemble performance on image classification problems is evaluated. Our results show that a stacked generalization ensemble of heterogeneous models is the most effective approach to image classification within OpenNAS.

Download Full-text

Skeleton-Based Action Recognition with Joint Coordinates as Feature Using Neural Oblivious Decision Ensembles

10.3233/faia210037 ◽

2021 ◽

Author(s):

Fakhrul Aniq Hakimi Nasrul ’Alam ◽

Mohd. Ibrahim Shapiai ◽

Uzma Batool ◽

Ahmad Kamal Ramli ◽

Khairil Ashraf Elias

Keyword(s):

Neural Networks ◽

Action Recognition ◽

Recurrent Neural Networks ◽

State Of The Art ◽

Structured Data ◽

Joint Coordinates ◽

Label Prediction ◽

Within Subjects ◽

High Degree ◽

Behaviour Recognition

Recognition of human behavior is critical in video monitoring, human-computer interaction, video comprehension, and virtual reality. The key problem with behaviour recognition in video surveillance is the high degree of variation between and within subjects. Numerous studies have suggested background-insensitive skeleton-based as the proven detection technique. The present state-of-the-art approaches to skeleton-based action recognition rely primarily on Recurrent Neural Networks (RNN) and Convolution Neural Networks (CNN). Both methods take dynamic human skeleton as the input to the network. We chose to handle skeleton data differently, relying solely on its skeleton joint coordinates as the input. The skeleton joints’ positions are defined in (x, y) coordinates. In this paper, we investigated the incorporation of the Neural Oblivious Decision Ensemble (NODE) into our proposed action classifier network. The skeleton is extracted using a pose estimation technique based on the Residual Network (ResNet). It extracts the 2D skeleton of 18 joints for each detected body. The joint coordinates of the skeleton are stored in a table in the form of rows and columns. Each row represents the position of the joints. The structured data are fed into NODE for label prediction. With the proposed network, we obtain 97.5% accuracy on RealWorld (HAR) dataset. Experimental results show that the proposed network outperforms one the state-of-the-art approaches by 1.3%. In conclusion, NODE is a promising deep learning technique for structured data analysis as compared to its machine learning counterparts such as the GBDT packages; Catboost, and XGBoost.

Download Full-text

Medical Knowledge Graph in Chinese Using Deep Semantic Mobile Computation Based on IoT and WoT

Wireless Communications and Mobile Computing ◽

10.1155/2021/5590754 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Wanheng Liu ◽

Ling Yin ◽

Cong Wang ◽

Fulin Liu ◽

Zhiyu Ni

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

State Of The Art ◽

Medical Knowledge ◽

Disease Diagnosis ◽

Knowledge Graph ◽

Great Success ◽

Smart Healthcare ◽

Made In

In this paper, a novel medical knowledge graph in Chinese approach applied in smart healthcare based on IoT and WoT is presented, using deep neural networks combined with self-attention to generate medical knowledge graph to make it more convenient for performing disease diagnosis and providing treatment advisement. Although great success has been made in the medical knowledge graph in recent studies, the issue of comprehensive medical knowledge graph in Chinese appropriate for telemedicine or mobile devices have been ignored. In our study, it is a working theory which is based on semantic mobile computing and deep learning. When several experiments have been carried out, it is demonstrated that it has better performance in generating various types of medical knowledge graph in Chinese, which is similar to that of the state-of-the-art. Also, it works well in the accuracy and comprehensive, which is much higher and highly consisted with the predictions of the theoretical model. It proves to be inspiring and encouraging that our work involving studies of medical knowledge graph in Chinese, which can stimulate the smart healthcare development.

Download Full-text

Model of Human Actions Recognition Based on 2D Kernel

ASME 2021 30th Conference on Information Storage and Processing Systems ◽

10.1115/isps2021-65031 ◽

2021 ◽

Author(s):

Bogdan Alexandru Radulescu ◽

Victorita Radulescu

Keyword(s):

Neural Networks ◽

Action Recognition ◽

Real Life ◽

Sensor Data ◽

Human Actions ◽

Advantages And Disadvantages ◽

Life Problems ◽

Starting Point ◽

Rgb Images ◽

Human Actions Recognition

Abstract Action Recognition is a domain that gains interest along with the development of specific motion capture equipment, hardware and power of processing. Its many applications in domains such as national security and behavior analysis make it even more popular among the scientific community, especially considering the ascending trend of machine learning methods. Nowadays approaches necessary to solve real life problems through human actions recognition became more interesting. To solve this problem are mainly two approaches when attempting to build a classifier, either using RGB images or sensor data, or where possible a combination of these two. Both methods have advantages and disadvantages and domains of utilization in real life problems, solvable through actions recognition. Using RGB input makes it possible to adopt a classifier on almost any infrastructure without specialized equipment, whereas combining video with sensor data provides a higher accuracy, albeit at a higher cost. Neural networks and especially convolutional neural networks are the starting point for human action recognition. By their nature, they can recognize very well spatial and temporal features, making them ideal for RGB images or sequences of RGB images. In the present paper is proposed the convolutional neural network architecture based on 2D kernels. Its structure, along with metrics measuring the performance, advantages and disadvantages are here illustrated. This solution based on 2D convolutions is fast, but has lower performance compared to other known solutions. The main problem when dealing with videos is the context extraction from a sequence of frames. Video classification using 2D Convolutional Layers is realized either by the most significant frame or by frame to frame, applying a probability distribution over the partial classes to obtain the final prediction. To classify actions, especially when differences between them are subtle, and consists of only a small part of the overall image is difficult. When classifying via the key frames, the total accuracy obtained is around 10%. The other approach, classifying each frame individually, proved to be too computationally expensive with negligible gains.

Download Full-text

Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/292 ◽

2020 ◽

Author(s):

Tuan Hoang ◽

Thanh-Toan Do ◽

Tam V. Nguyen ◽

Ngai-Man Cheung

Keyword(s):

Neural Networks ◽

Cost Function ◽

Image Classification ◽

Convolutional Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

State Of The Art ◽

Deep Convolutional Neural Networks ◽

Novel Method ◽

The Cost

This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations. First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights. However, this approach would result in some mismatch: the gradient descent updates full-precision weights, but it does not update the quantized weights. To address this issue, we propose a novel method that enables direct updating of quantized weights with learnable quantization levels to minimize the cost function using gradient descent. Second, to obtain low bit-width activations, existing works consider all channels equally. However, the activation quantizers could be biased toward a few channels with high-variance. To address this issue, we propose a method to take into account the quantization errors of individual channels. With this approach, we can learn activation quantizers that minimize the quantization errors in the majority of channels. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the image classification task, using AlexNet, ResNet and MobileNetV2 architectures on CIFAR-100 and ImageNet datasets.

Download Full-text

Towards Interpretable Semantic Segmentation via Gradient-Weighted Class Activation Mapping (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7244 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13943-13944

Author(s):

Kira Vinogradova ◽

Alexandr Dibrov ◽

Gene Myers

Keyword(s):

Neural Networks ◽

Image Segmentation ◽

Image Classification ◽

Convolutional Neural Networks ◽

Image Recognition ◽

State Of The Art ◽

Semantic Segmentation ◽

Wide Range ◽

Gradient Based ◽

Activation Mapping

Convolutional neural networks have become state-of-the-art in a wide range of image recognition tasks. The interpretation of their predictions, however, is an active area of research. Whereas various interpretation methods have been suggested for image classification, the interpretation of image segmentation still remains largely unexplored. To that end, we propose seg-grad-cam, a gradient-based method for interpreting semantic segmentation. Our method is an extension of the widely-used Grad-CAM method, applied locally to produce heatmaps showing the relevance of individual pixels for semantic segmentation.

Download Full-text

A semantic tree method for image classification and video action recognition

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691318400088 ◽

2018 ◽

Vol 16 (02) ◽

pp. 1840008

Author(s):

Chongwen Liu ◽

Zhaowei Shang ◽

Bo Lin ◽

Yuan Yan Tang

Keyword(s):

Image Classification ◽

Action Recognition ◽

State Of The Art ◽

Learning Performance ◽

Main Task ◽

Experimental Performance ◽

Semantic Tree ◽

Shared Information ◽

Public Datasets ◽

Tree Method

The multi-task learning (MTL) methods consider learning a problem together with other related problems simultaneously. The major challenge of MTL is how to selectively screen the shared information. The information of each task must be related to the others, but when sharing information between two unrelated tasks it degenerates the performance of both tasks. To ensure the related problems are related to the main task is the most important point in MTL. In this paper, we will design a novel algorithm to calculate the degrees of relationship among tasks by using a semantical space of features in each task and then build semantical tree to achieve better learning performance. We propose an MTL method under this algorithm which achieves good experimental performance. Our experiments are taken on both image classification and video action recognition, compared with the state-of-the-art MTL methods. Our method proposes good performance in the four public datasets.

Download Full-text

Load Classification: A Case Study for Applying Neural Networks in Hyper-Constrained Embedded Devices

Applied Sciences ◽

10.3390/app112411957 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11957

Author(s):

Andrea Agiollo ◽

Andrea Omicini

Keyword(s):

Neural Networks ◽

Language Processing ◽

State Of The Art ◽

Research Community ◽

Positive Outcomes ◽

Embedded Devices ◽

Open Research ◽

Made In ◽

Constrained Devices

The application of Artificial Intelligence to the industrial world and its appliances has recently grown in popularity. Indeed, AI techniques are now becoming the de-facto technology for the resolution of complex tasks concerning computer vision, natural language processing and many other areas. In the last years, most of the the research community efforts have focused on increasing the performance of most common AI techniques—e.g., Neural Networks, etc.—at the expenses of their complexity. Indeed, many works in the AI field identify and propose hyper-efficient techniques, targeting high-end devices. However, the application of such AI techniques to devices and appliances which are characterised by limited computational capabilities, remains an open research issue. In the industrial world, this problem heavily targets low-end appliances, which are developed focusing on saving costs and relying on—computationally—constrained components. While some efforts have been made in this area through the proposal of AI-simplification and AI-compression techniques, it is still relevant to study which available AI techniques can be used in modern constrained devices. Therefore, in this paper we propose a load classification task as a case study to analyse which state-of-the-art NN solutions can be embedded successfully into constrained industrial devices. The presented case study is tested on a simple microcontroller, characterised by very poor computational performances—i.e., FLOPS –, to mirror faithfully the design process of low-end appliances. A handful of NN models are tested, showing positive outcomes and possible limitations, and highlighting the complexity of AI embedding.

Download Full-text

MFF-Net: Deepfake Detection Network Based on Multi-Feature Fusion

Entropy ◽

10.3390/e23121692 ◽

2021 ◽

Vol 23 (12) ◽

pp. 1692

Author(s):

Lei Zhao ◽

Mingcheng Zhang ◽

Hongwei Ding ◽

Xiaohui Cui

Keyword(s):

Feature Extraction ◽

Semantic Information ◽

Feature Fusion ◽

State Of The Art ◽

Detection Methods ◽

Textural Features ◽

Detection Technology ◽

Rgb Images ◽

Signal Processing Methods ◽

Made In

Significant progress has been made in generating counterfeit images and videos. Forged videos generated by deepfaking have been widely spread and have caused severe societal impacts, which stir up public concern about automatic deepfake detection technology. Recently, many deepfake detection methods based on forged features have been proposed. Among the popular forged features, textural features are widely used. However, most of the current texture-based detection methods extract textures directly from RGB images, ignoring the mature spectral analysis methods. Therefore, this research proposes a deepfake detection network fusing RGB features and textural information extracted by neural networks and signal processing methods, namely, MFF-Net. Specifically, it consists of four key components: (1) a feature extraction module to further extract textural and frequency information using the Gabor convolution and residual attention blocks; (2) a texture enhancement module to zoom into the subtle textural features in shallow layers; (3) an attention module to force the classifier to focus on the forged part; (4) two instances of feature fusion to firstly fuse textural features from the shallow RGB branch and feature extraction module and then to fuse the textural features and semantic information. Moreover, we further introduce a new diversity loss to force the feature extraction module to learn features of different scales and directions. The experimental results show that MFF-Net has excellent generalization and has achieved state-of-the-art performance on various deepfake datasets.

Download Full-text