Skeleton-Based Gesture Recognition Using Several Fully Connected Layers with Path Signature Features and Temporal Transformer Module

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018585 ◽

2019 ◽

Vol 33 ◽

pp. 8585-8593 ◽

Cited By ~ 3

Author(s):

Chenyang Li ◽

Xin Zhang ◽

Lufan Liao ◽

Lianwen Jin ◽

Weixin Yang

Keyword(s):

Gesture Recognition ◽

Network Architecture ◽

Temporal Dynamics ◽

Classification Model ◽

Feature Descriptor ◽

Temporal Features ◽

Key Issues ◽

Motion Characteristics ◽

Fully Connected ◽

Global And Local

The skeleton based gesture recognition is gaining more popularity due to its wide possible applications. The key issues are how to extract discriminative features and how to design the classification model. In this paper, we first leverage a robust feature descriptor, path signature (PS), and propose three PS features to explicitly represent the spatial and temporal motion characteristics, i.e., spatial PS (S PS), temporal PS (T PS) and temporal spatial PS (T S PS). Considering the significance of fine hand movements in the gesture, we propose an ”attention on hand” (AOH) principle to define joint pairs for the S PS and select single joint for the T PS. In addition, the dyadic method is employed to extract the T PS and T S PS features that encode global and local temporal dynamics in the motion. Secondly, without the recurrent strategy, the classification model still faces challenges on temporal variation among different sequences. We propose a new temporal transformer module (TTM) that can match the sequence key frames by learning the temporal shifting parameter for each input. This is a learning-based module that can be included into standard neural network architecture. Finally, we design a multi-stream fully connected layer based network to treat spatial and temporal features separately and fused them together for the final result. We have tested our method on three benchmark gesture datasets, i.e., ChaLearn 2016, ChaLearn 2013 and MSRC-12. Experimental results demonstrate that we achieve the state-of-the-art performance on skeleton-based gesture recognition with high computational efficiency.

Download Full-text

Connectionist Temporal Classification Model for Dynamic Hand Gesture Recognition using RGB and Optical flow Data

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/4/8 ◽

2020 ◽

Vol 17 (4) ◽

pp. 497-506

Author(s):

Sunil Patel ◽

Ramji Makwana

Keyword(s):

Neural Network ◽

Optical Flow ◽

Gesture Recognition ◽

Hand Gesture Recognition ◽

Classification Model ◽

Hand Gesture ◽

Flow Data ◽

Dynamic Hand Gesture Recognition ◽

Connectionist Temporal Classification

Automatic classification of dynamic hand gesture is challenging due to the large diversity in a different class of gesture, Low resolution, and it is performed by finger. Due to a number of challenges many researchers focus on this area. Recently deep neural network can be used for implicit feature extraction and Soft Max layer is used for classification. In this paper, we propose a method based on a two-dimensional convolutional neural network that performs detection and classification of hand gesture simultaneously from multimodal Red, Green, Blue, Depth (RGBD) and Optical flow Data and passes this feature to Long-Short Term Memory (LSTM) recurrent network for frame-to-frame probability generation with Connectionist Temporal Classification (CTC) network for loss calculation. We have calculated an optical flow from Red, Green, Blue (RGB) data for getting proper motion information present in the video. CTC model is used to efficiently evaluate all possible alignment of hand gesture via dynamic programming and check consistency via frame-to-frame for the visual similarity of hand gesture in the unsegmented input stream. CTC network finds the most probable sequence of a frame for a class of gesture. The frame with the highest probability value is selected from the CTC network by max decoding. This entire CTC network is trained end-to-end with calculating CTC loss for recognition of the gesture. We have used challenging Vision for Intelligent Vehicles and Applications (VIVA) dataset for dynamic hand gesture recognition captured with RGB and Depth data. On this VIVA dataset, our proposed hand gesture recognition technique outperforms competing state-of-the-art algorithms and gets an accuracy of 86%

Download Full-text

Implementation of MQTT protocol based network architecture for smart factory

Proceedings of the Institution of Mechanical Engineers Part B Journal of Engineering Manufacture ◽

10.1177/09544054211014488 ◽

2021 ◽

pp. 095440542110144

Author(s):

Chia-Shin Yeh ◽

Shang-Liang Chen ◽

I-Ching Li

Keyword(s):

Machine Tool ◽

Gesture Recognition ◽

Manufacturing Process ◽

Network Architecture ◽

Information Management System ◽

Edge Computing ◽

Smart Factory ◽

Smart Manufacturing ◽

Industrial Iot ◽

Computing Module

The core concept of smart manufacturing is based on digitization to construct intelligent production and management in the manufacturing process. By digitizing the production process and connecting all levels from product design to service, the purpose of improving manufacturing efficiency, reducing production cost, enhancing product quality, and optimizing user experience can be achieved. To digitize the manufacturing process, IoT technology will have to be introduced into the manufacturing process to collect and analyze process information. However, one of the most important problems in building the industrial IoT (IIoT) environment is that different industrial network protocols are used for different equipment in factories. Therefore, the information in the manufacturing process may not be easily exchanged and obtained. To solve the above problem, a smart factory network architecture based on MQTT (MQ Telemetry Transport), IoT communication protocol, is proposed in this study, to construct a heterogeneous interface communication bridge between the machine tool, embedded device Raspberry Pi, and website. Finally, the system architecture is implemented and imported into the factory, and a smart manufacturing information management system is developed. The edge computing module is set up beside a three-axis machine tool, and a human-machine interface is built for the user controlling and monitoring. Users can also monitor the system through the dynamically updating website at any time and any place. The function of real-time gesture recognition based on image technology is developed and built on the edge computing module. The gesture recognition results can be transmitted to the machine controller through MQTT, and the machine will execute the corresponding action according to different gestures to achieve human-robot collaboration. The MQTT transmission architecture developed here is validated by the given edge computing application. It can serve as the basis for the construction of the IIoT environment, assist the traditional manufacturing industry to prepare for digitization, and accelerate the practice of smart manufacturing.

Download Full-text

Enhanced Convolutional-Neural-Network Architecture for Crop Classification

Applied Sciences ◽

10.3390/app11094292 ◽

2021 ◽

Vol 11 (9) ◽

pp. 4292

Author(s):

Mónica Y. Moreno-Revelo ◽

Lorena Guachi-Guachi ◽

Juan Bernardo Gómez-Mendoza ◽

Javier Revelo-Fuelagán ◽

Diego H. Peluffo-Ordóñez

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Network Architecture ◽

Classification Model ◽

Classification Error ◽

Small Scale ◽

Post Processing ◽

Average Accuracy ◽

Processing Step ◽

Crop Classification

Automatic crop identification and monitoring is a key element in enhancing food production processes as well as diminishing the related environmental impact. Although several efficient deep learning techniques have emerged in the field of multispectral imagery analysis, the crop classification problem still needs more accurate solutions. This work introduces a competitive methodology for crop classification from multispectral satellite imagery mainly using an enhanced 2D convolutional neural network (2D-CNN) designed at a smaller-scale architecture, as well as a novel post-processing step. The proposed methodology contains four steps: image stacking, patch extraction, classification model design (based on a 2D-CNN architecture), and post-processing. First, the images are stacked to increase the number of features. Second, the input images are split into patches and fed into the 2D-CNN model. Then, the 2D-CNN model is constructed within a small-scale framework, and properly trained to recognize 10 different types of crops. Finally, a post-processing step is performed in order to reduce the classification error caused by lower-spatial-resolution images. Experiments were carried over the so-named Campo Verde database, which consists of a set of satellite images captured by Landsat and Sentinel satellites from the municipality of Campo Verde, Brazil. In contrast to the maximum accuracy values reached by remarkable works reported in the literature (amounting to an overall accuracy of about 81%, a f1 score of 75.89%, and average accuracy of 73.35%), the proposed methodology achieves a competitive overall accuracy of 81.20%, a f1 score of 75.89%, and an average accuracy of 88.72% when classifying 10 different crops, while ensuring an adequate trade-off between the number of multiply-accumulate operations (MACs) and accuracy. Furthermore, given its ability to effectively classify patches from two image sequences, this methodology may result appealing for other real-world applications, such as the classification of urban materials.

Download Full-text

Video-Based Person Re-Identification by an End-To-End Learning Architecture with Hybrid Deep Appearance-Temporal Feature

Sensors ◽

10.3390/s18113669 ◽

2018 ◽

Vol 18 (11) ◽

pp. 3669 ◽

Cited By ~ 3

Author(s):

Rui Sun ◽

Qiheng Huang ◽

Miaomiao Xia ◽

Jun Zhang

Keyword(s):

Feature Representation ◽

Visual Sensor ◽

Stream Structure ◽

Temporal Features ◽

Feature Structure ◽

End To End ◽

Global And Local ◽

Siamese Networks ◽

Appearance Feature ◽

Temporal Feature

Video-based person re-identification is an important task with the challenges of lighting variation, low-resolution images, background clutter, occlusion, and human appearance similarity in the multi-camera visual sensor networks. In this paper, we propose a video-based person re-identification method called the end-to-end learning architecture with hybrid deep appearance-temporal feature. It can learn the appearance features of pivotal frames, the temporal features, and the independent distance metric of different features. This architecture consists of two-stream deep feature structure and two Siamese networks. For the first-stream structure, we propose the Two-branch Appearance Feature (TAF) sub-structure to obtain the appearance information of persons, and used one of the two Siamese networks to learn the similarity of appearance features of a pairwise person. To utilize the temporal information, we designed the second-stream structure that consisting of the Optical flow Temporal Feature (OTF) sub-structure and another Siamese network, to learn the person’s temporal features and the distances of pairwise features. In addition, we select the pivotal frames of video as inputs to the Inception-V3 network on the Two-branch Appearance Feature sub-structure, and employ the salience-learning fusion layer to fuse the learned global and local appearance features. Extensive experimental results on the PRID2011, iLIDS-VID, and Motion Analysis and Re-identification Set (MARS) datasets showed that the respective proposed architectures reached 79%, 59% and 72% at Rank-1 and had advantages over state-of-the-art algorithms. Meanwhile, it also improved the feature representation ability of persons.

Download Full-text

A novel Krawtchouk moment zonal feature descriptor for user-independent static hand gesture recognition

2016 IEEE Region 10 Conference (TENCON) ◽

10.1109/tencon.2016.7848027 ◽

2016 ◽

Cited By ~ 1

Author(s):

Subhamoy Chatterjee ◽

Piyush Bhandari ◽

MaheshKumar H. Kolekar

Keyword(s):

Gesture Recognition ◽

Hand Gesture Recognition ◽

Hand Gesture ◽

Feature Descriptor

Download Full-text

Enhancing Intelligent Anemia Detection via Unifying Global and Local Views of Conjunctiva Image with Two-Branch Neural Networks

10.21203/rs.3.rs-1170958/v1 ◽

2022 ◽

Author(s):

Lijuan Zheng ◽

Shaopeng Liu ◽

Senping Tian ◽

Jianhua Guo ◽

Xinpeng Wang ◽

...

Keyword(s):

Network Architecture ◽

Clinical Symptoms ◽

Detection Methods ◽

Testing Methods ◽

Neural Network Architecture ◽

Real World Application ◽

Algorithmic Framework ◽

Comparable Accuracy ◽

Global And Local ◽

Detection Effect

Abstract Anemia is one of the most widespread clinical symptoms all over the world, which could bring adverse effects on people's daily life and work. Considering the universality of anemia detection and the inconvenience of traditional blood testing methods, many deep learning detection methods based on image recognition have been developed in recent years, including the methods of anemia detection with individuals’ images of conjunctiva. However, existing methods using one single conjunctiva image could not reach comparable accuracy in anemia detection in many real-world application scenarios. To enhance intelligent anemia detection using conjunctiva images, we proposed a new algorithmic framework which could make full use of the data information contained in the image. To be concrete, we proposed to fully explore the global and local information in the image, and adopted a two-branch neural network architecture to unify the information of these two aspects. Compared with the existing methods, our method can fully explore the information contained in a single conjunctiva image and achieve more reliable anemia detection effect. Compared with other existing methods, the experimental results verified the effectiveness of the new algorithm.

Download Full-text

A High-robustness and Low Resource-consumption Crowd Counting Model

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2021.15.6 ◽

2021 ◽

Vol 15 ◽

pp. 46-54

Author(s):

Han Jia ◽

Xuecheng Zou

Keyword(s):

Network Architecture ◽

Resource Consumption ◽

Portable Devices ◽

Estimation Errors ◽

Crowd Counting ◽

Density Map ◽

Almost All ◽

Fully Connected ◽

Crowded Scenes ◽

Counting Model

A major problem of counting high-density crowded scenes is the lack of flexibility and robustness exhibited by existing methods, and almost all recent state-of-the-art methods only show good performance in estimation errors and density map quality for select datasets. The biggest challenge faced by these methods is the analysis of similar features between the crowd and background, as well as overlaps between individuals. Hence, we propose a light and easy-to-train network for congestion cognition based on dilated convolution, which can exponentially enlarge the receptive field, preserve original resolution, and generate a high-quality density map. With the dilated convolutional layers, the counting accuracy can be enhanced as the feature map keeps its original resolution. By removing fully-connected layers, the network architecture becomes more concise, thereby reducing resource consumption significantly. The flexibility and robustness improvements of the proposed network compared to previous methods were validated using the variance of data size and different overlap levels of existing open source datasets. Experimental results showed that the proposed network is suitable for transfer learning on different datasets and enhances crowd counting in highly congested scenes. Therefore, the network is expected to have broader applications, for example in Internet of Things and portable devices.

Download Full-text

Diagnosis of hearing deficiency using EEG based AEP signals: CWT and improved-VGG16 pipeline

PeerJ Computer Science ◽

10.7717/peerj-cs.638 ◽

2021 ◽

Vol 7 ◽

pp. e638

Author(s):

Md Nahidul Islam ◽

Norizam Sulaiman ◽

Fahmid Al Farid ◽

Jia Uddin ◽

Salem A. Alyami ◽

...

Keyword(s):

Hearing Loss ◽

Network Architecture ◽

Auditory Evoked Potential ◽

Human Communication ◽

Time Frequency ◽

The Neural Network ◽

Cortex Area ◽

Wide Range ◽

Functional Reliability ◽

Fully Connected

Hearing deficiency is the world’s most common sensation of impairment and impedes human communication and learning. Early and precise hearing diagnosis using electroencephalogram (EEG) is referred to as the optimum strategy to deal with this issue. Among a wide range of EEG control signals, the most relevant modality for hearing loss diagnosis is auditory evoked potential (AEP) which is produced in the brain’s cortex area through an auditory stimulus. This study aims to develop a robust intelligent auditory sensation system utilizing a pre-train deep learning framework by analyzing and evaluating the functional reliability of the hearing based on the AEP response. First, the raw AEP data is transformed into time-frequency images through the wavelet transformation. Then, lower-level functionality is eliminated using a pre-trained network. Here, an improved-VGG16 architecture has been designed based on removing some convolutional layers and adding new layers in the fully connected block. Subsequently, the higher levels of the neural network architecture are fine-tuned using the labelled time-frequency images. Finally, the proposed method’s performance has been validated by a reputed publicly available AEP dataset, recorded from sixteen subjects when they have heard specific auditory stimuli in the left or right ear. The proposed method outperforms the state-of-art studies by improving the classification accuracy to 96.87% (from 57.375%), which indicates that the proposed improved-VGG16 architecture can significantly deal with AEP response in early hearing loss diagnosis.

Download Full-text

Research on key issues of gesture recognition for artificial intelligence

Soft Computing ◽

10.1007/s00500-019-04342-3 ◽

2019 ◽

Vol 24 (8) ◽

pp. 5795-5803 ◽

Cited By ~ 1

Author(s):

Taiping Mo ◽

Peng Sun

Keyword(s):

Artificial Intelligence ◽

Gesture Recognition ◽

Key Issues

Download Full-text

Efficient Semantic Segmentation Using Multi-Path Decoder

Applied Sciences ◽

10.3390/app10186386 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6386

Author(s):

Xing Bai ◽

Jun Zhou

Keyword(s):

Neural Network ◽

Real Time ◽

Network Architecture ◽

Resource Constraints ◽

Cost Effective ◽

Semantic Segmentation ◽

Classification Model ◽

Neural Network Architecture ◽

Great Progress ◽

Different Types

Benefiting from the booming of deep learning, the state-of-the-art models achieved great progress. But they are huge in terms of parameters and floating point operations, which makes it hard to apply them to real-time applications. In this paper, we propose a novel deep neural network architecture, named MPDNet, for fast and efficient semantic segmentation under resource constraints. First, we use a light-weight classification model pretrained on ImageNet as the encoder. Second, we use a cost-effective upsampling datapath to restore prediction resolution and convert features for classification into features for segmentation. Finally, we propose to use a multi-path decoder to extract different types of features, which are not ideal to process inside only one convolutional neural network. The experimental results of our model outperform other models aiming at real-time semantic segmentation on Cityscapes. Based on our proposed MPDNet, we achieve 76.7% mean IoU on Cityscapes test set with only 118.84GFLOPs and achieves 37.6 Hz on 768 × 1536 images on a standard GPU.

Download Full-text