High Performance Gesture Recognition via Effective and Efficient Temporal Modeling

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/141 ◽

2019 ◽

Author(s):

Yang Yi ◽

Feng Ni ◽

Yuexin Ma ◽

Xinge Zhu ◽

Yuankai Qi ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Gesture Recognition ◽

High Performance ◽

Short Term Memory ◽

State Of The Art ◽

Computational Cost ◽

Temporal Modeling ◽

Spatiotemporal Features ◽

Public Datasets

State-of-the-art hand gesture recognition methods have investigated the spatiotemporal features based on 3D convolutional neural networks (3DCNNs) or convolutional long short-term memory (ConvLSTM). However, they often suffer from the inefficiency due to the high computational complexity of their network structures. In this paper, we focus instead on the 1D convolutional neural networks and propose a simple and efficient architectural unit, Multi-Kernel Temporal Block (MKTB), that models the multi-scale temporal responses by explicitly applying different temporal kernels. Then, we present a Global Refinement Block (GRB), which is an attention module for shaping the global temporal features based on the cross-channel similarity. By incorporating the MKTB and GRB, our architecture can effectively explore the spatiotemporal features within tolerable computational cost. Extensive experiments conducted on public datasets demonstrate that our proposed model achieves the state-of-the-art with higher efficiency. Moreover, the proposed MKTB and GRB are plug-and-play modules and the experiments on other tasks, like video understanding and video-based person re-identification, also display their good performance in efficiency and capability of generalization.

Download Full-text

Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks

Applied Sciences ◽

10.3390/app112411738 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11738

Author(s):

Thomas Teixeira ◽

Éric Granger ◽

Alessandro Lameiras Koerich

Keyword(s):

Neural Networks ◽

Facial Expression ◽

Emotion Recognition ◽

Facial Expressions ◽

Convolutional Neural Networks ◽

Affective Computing ◽

Spatial Information ◽

Short Term Memory ◽

State Of The Art ◽

Fine Tuning

Facial expressions are one of the most powerful ways to depict specific patterns in human behavior and describe the human emotional state. However, despite the impressive advances of affective computing over the last decade, automatic video-based systems for facial expression recognition still cannot correctly handle variations in facial expression among individuals as well as cross-cultural and demographic aspects. Nevertheless, recognizing facial expressions is a difficult task, even for humans. This paper investigates the suitability of state-of-the-art deep learning architectures based on convolutional neural networks (CNNs) to deal with long video sequences captured in the wild for continuous emotion recognition. For such an aim, several 2D CNN models that were designed to model spatial information are extended to allow spatiotemporal representation learning from videos, considering a complex and multi-dimensional emotion space, where continuous values of valence and arousal must be predicted. We have developed and evaluated convolutional recurrent neural networks, combining 2D CNNs and long short term-memory units and inflated 3D CNN models, which are built by inflating the weights of a pre-trained 2D CNN model during fine-tuning, using application-specific videos. Experimental results on the challenging SEWA-DB dataset have shown that these architectures can effectively be fine-tuned to encode spatiotemporal information from successive raw pixel images and achieve state-of-the-art results on such a dataset.

Download Full-text

A comparison of convolutional neural networks for Kazakh sign language recognition

Eastern-European Journal of Enterprise Technologies ◽

10.15587/1729-4061.2021.241535 ◽

2021 ◽

Vol 5 (2 (113)) ◽

pp. 44-54

Author(s):

Chingiz Kenshimov ◽

Samat Mukhanov ◽

Timur Merembayev ◽

Didar Yedilkhan

Keyword(s):

Neural Networks ◽

Sign Language ◽

Convolutional Neural Networks ◽

Gesture Recognition ◽

State Of The Art ◽

Hand Gesture ◽

Language Recognition ◽

Sign Language Recognition ◽

Important Means ◽

Complex Relationships

For people with disabilities, sign language is the most important means of communication. Therefore, more and more authors of various papers and scientists around the world are proposing solutions to use intelligent hand gesture recognition systems. Such a system is aimed not only for those who wish to understand a sign language, but also speak using gesture recognition software. In this paper, a new benchmark dataset for Kazakh fingerspelling, able to train deep neural networks, is introduced. The dataset contains more than 10122 gesture samples for 42 alphabets. The alphabet has its own peculiarities as some characters are shown in motion, which may influence sign recognition. Research and analysis of convolutional neural networks, comparison, testing, results and analysis of LeNet, AlexNet, ResNet and EffectiveNet – EfficientNetB7 methods are described in the paper. EffectiveNet architecture is state-of-the-art (SOTA) and is supposed to be a new one compared to other architectures under consideration. On this dataset, we showed that the LeNet and EffectiveNet networks outperform other competing algorithms. Moreover, EffectiveNet can achieve state-of-the-art performance on nother hand gesture datasets. The architecture and operation principle of these algorithms reflect the effectiveness of their application in sign language recognition. The evaluation of the CNN model score is conducted by using the accuracy and penalty matrix. During training epochs, LeNet and EffectiveNet showed better results: accuracy and loss function had similar and close trends. The results of EffectiveNet were explained by the tools of the SHapley Additive exPlanations (SHAP) framework. SHAP explored the model to detect complex relationships between features in the images. Focusing on the SHAP tool may help to further improve the accuracy of the model

Download Full-text

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Lecture Notes in Computer Science - Medical Image Computing and Computer Assisted Intervention – MICCAI 2019 ◽

10.1007/978-3-030-32254-0_52 ◽

2019 ◽

pp. 467-475 ◽

Cited By ~ 4

Author(s):

Isabel Funke ◽

Sebastian Bodenstedt ◽

Florian Oehme ◽

Felix von Bechtolsheim ◽

Jürgen Weitz ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Gesture Recognition ◽

Spatiotemporal Features

Download Full-text

Direct micro-seismic event location and characterization from passive seismic data using convolutional neural networks

Geophysics ◽

10.1190/geo2020-0636.1 ◽

2021 ◽

pp. 1-77

Author(s):

Hanchen Wang ◽

Tariq Alkhalifah

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Networks ◽

Event Detection ◽

Seismic Data ◽

High Performance ◽

Input Data ◽

Waveform Inversion ◽

Computational Cost ◽

Seismic Events

The ample size of time-lapse data often requires significant event detection and source location efforts, especially in areas like shale gas exploration regions where a large number of micro-seismic events are often recorded. In many cases, the real-time monitoring and locating of these events are essential to production decisions. Conventional methods face considerable drawbacks. For example, traveltime-based methods require traveltime picking of often noisy data, while migration and waveform inversion methods require expensive wavefield solutions and event detection. Both tasks require some human intervention, and this becomes a big problem when too many sources need to be located, which is common in micro-seismic monitoring. Machine learning has recently been used to identify micro-seismic events or locate their sources once they are identified and picked. We propose to use a novel artificial neural network framework to directly map seismic data, without any event picking or detection, to their potential source locations. We train two convolutional neural networks on labeled synthetic acoustic data containing simulated micro-seismic events to fulfill such requirements. One convolutional neural network, which has a global average pooling layer to reduce the computational cost while maintaining high-performance levels, aims to classify the number of events in the data. The other network predicts the source locations and other source features such as the source peak frequencies and amplitudes. To reduce the size of the input data to the network, we correlate the recorded traces with a central reference trace to allow the network to focus on the curvature of the input data near the zero-lag region. We train the networks to handle single, multi, and no event segments extracted from the data. Tests on a simple vertical varying model and a more realistic Otway field model demonstrate the approach's versatility and potential.

Download Full-text

Groundwater level forecasting with artificial neural networks: a comparison of long short-term memory (LSTM), convolutional neural networks (CNNs), and non-linear autoregressive networks with exogenous input (NARX)

Hydrology and Earth System Sciences ◽

10.5194/hess-25-1671-2021 ◽

2021 ◽

Vol 25 (3) ◽

pp. 1671-1687

Author(s):

Andreas Wunsch ◽

Tanja Liesch ◽

Stefan Broda

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Convolutional Neural Networks ◽

Groundwater Level ◽

Short Term Memory ◽

State Of The Art ◽

Short Term ◽

Term Memory ◽

Non Linear ◽

Long Short Term Memory

Abstract. It is now well established to use shallow artificial neural networks (ANNs) to obtain accurate and reliable groundwater level forecasts, which are an important tool for sustainable groundwater management. However, we observe an increasing shift from conventional shallow ANNs to state-of-the-art deep-learning (DL) techniques, but a direct comparison of the performance is often lacking. Although they have already clearly proven their suitability, shallow recurrent networks frequently seem to be excluded from the study design due to the euphoria about new DL techniques and its successes in various disciplines. Therefore, we aim to provide an overview on the predictive ability in terms of groundwater levels of shallow conventional recurrent ANNs, namely non-linear autoregressive networks with exogenous input (NARX) and popular state-of-the-art DL techniques such as long short-term memory (LSTM) and convolutional neural networks (CNNs). We compare the performance on both sequence-to-value (seq2val) and sequence-to-sequence (seq2seq) forecasting on a 4-year period while using only few, widely available and easy to measure meteorological input parameters, which makes our approach widely applicable. Further, we also investigate the data dependency in terms of time series length of the different ANN architectures. For seq2val forecasts, NARX models on average perform best; however, CNNs are much faster and only slightly worse in terms of accuracy. For seq2seq forecasts, mostly NARX outperform both DL models and even almost reach the speed of CNNs. However, NARX are the least robust against initialization effects, which nevertheless can be handled easily using ensemble forecasting. We showed that shallow neural networks, such as NARX, should not be neglected in comparison to DL techniques especially when only small amounts of training data are available, where they can clearly outperform LSTMs and CNNs; however, LSTMs and CNNs might perform substantially better with a larger dataset, where DL really can demonstrate its strengths, which is rarely available in the groundwater domain though.

Download Full-text

Hybrid pooling with wavelets for convolutional neural networks

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219223 ◽

2022 ◽

pp. 1-10

Author(s):

Daniel Trevino-Sanchez ◽

Vicente Alarcon-Aquino

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Computational Cost ◽

Relevant Information ◽

Accuracy Improvement ◽

Proposed Model ◽

Benchmark Datasets ◽

Augmentation Techniques ◽

High Computational Cost

The need to detect and classify objects correctly is a constant challenge, being able to recognize them at different scales and scenarios, sometimes cropped or badly lit is not an easy task. Convolutional neural networks (CNN) have become a widely applied technique since they are completely trainable and suitable to extract features. However, the growing number of convolutional neural networks applications constantly pushes their accuracy improvement. Initially, those improvements involved the use of large datasets, augmentation techniques, and complex algorithms. These methods may have a high computational cost. Nevertheless, feature extraction is known to be the heart of the problem. As a result, other approaches combine different technologies to extract better features to improve the accuracy without the need of more powerful hardware resources. In this paper, we propose a hybrid pooling method that incorporates multiresolution analysis within the CNN layers to reduce the feature map size without losing details. To prevent relevant information from losing during the downsampling process an existing pooling method is combined with wavelet transform technique, keeping those details "alive" and enriching other stages of the CNN. Achieving better quality characteristics improves CNN accuracy. To validate this study, ten pooling methods, including the proposed model, are tested using four benchmark datasets. The results are compared with four of the evaluated methods, which are also considered as the state-of-the-art.

Download Full-text

LdsConv: Learned Depthwise Separable Convolutions by Group Pruning

Sensors ◽

10.3390/s20154349 ◽

2020 ◽

Vol 20 (15) ◽

pp. 4349

Author(s):

Wenxiang Lin ◽

Yan Ding ◽

Hua-Liang Wei ◽

Xinglin Pan ◽

Yutong Zhang

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Computational Cost ◽

The State ◽

Direct Replacement ◽

Improved Accuracy ◽

Pruning Technique ◽

Strong Capacity

Standard convolutional filters usually capture unnecessary overlap of features resulting in a waste of computational cost. In this paper, we aim to solve this problem by proposing a novel Learned Depthwise Separable Convolution (LdsConv) operation that is smart but has a strong capacity for learning. It integrates the pruning technique into the design of convolutional filters, formulated as a generic convolutional unit that can be used as a direct replacement of convolutions without any adjustments of the architecture. To show the effectiveness of the proposed method, experiments are carried out using the state-of-the-art convolutional neural networks (CNNs), including ResNet, DenseNet, SE-ResNet and MobileNet, respectively. The results show that by simply replacing the original convolution with LdsConv in these CNNs, it can achieve a significantly improved accuracy while reducing computational cost. For the case of ResNet50, the FLOPs can be reduced by 40.9%, meanwhile the accuracy on the associated ImageNet increases.

Download Full-text

Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition

Pattern Recognition ◽

10.1016/j.patcog.2017.10.033 ◽

2018 ◽

Vol 76 ◽

pp. 80-94 ◽

Cited By ~ 93

Author(s):

Juan C. Núñez ◽

Raúl Cabido ◽

Juan J. Pantrigo ◽

Antonio S. Montemayor ◽

José F. Vélez

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Human Activity ◽

Gesture Recognition ◽

Short Term Memory ◽

Hand Gesture Recognition ◽

Hand Gesture ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Gesture Recognition using Wearable Sensors with Bi-Long Short-Term Memory Convolutional Neural Networks

IEEE Sensors Journal ◽

10.1109/jsen.2021.3074642 ◽

2021 ◽

pp. 1-1

Author(s):

Khanh Nguyen-Trong ◽

Hoai Nam Vu ◽

Ngon Nguyen Trung ◽

Cuong Pham

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Gesture Recognition ◽

Short Term Memory ◽

Wearable Sensors ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Optimizing 3D Convolution Kernels on Stereo Matching for Resource Efficient Computations

Sensors ◽

10.3390/s21206808 ◽

2021 ◽

Vol 21 (20) ◽

pp. 6808

Author(s):

Jianqiang Xiao ◽

Dianbo Ma ◽

Satoshi Yamane

Keyword(s):

Neural Networks ◽

Computational Complexity ◽

Convolutional Neural Networks ◽

Stereo Matching ◽

State Of The Art ◽

Computational Cost ◽

The State ◽

Matching Network ◽

Convolution Kernels ◽

Low Computational Cost

Despite recent stereo matching algorithms achieving significant results on public benchmarks, the problem of requiring heavy computation remains unsolved. Most works focus on designing an architecture to reduce the computational complexity, while we take aim at optimizing 3D convolution kernels on the Pyramid Stereo Matching Network (PSMNet) for solving the problem. In this paper, we design a series of comparative experiments exploring the performance of well-known convolution kernels on PSMNet. Our model saves the computational complexity from 256.66G MAdd (Multiply-Add operations) to 69.03G MAdd (198.47G MAdd to 10.84G MAdd for only considering 3D convolutional neural networks) without losing accuracy. On Scene Flow and KITTI 2015 datasets, our model achieves results comparable to the state-of-the-art with a low computational cost.

Download Full-text