Extraction of multimodal features from depth and RGB images for recognising hand gestures

With approaches for the detection of joint positions in color images such as HRNet and OpenPose being available, consideration of corresponding approaches for depth images is limited even though depth images have several advantages over color images like robustness to light variation or color- and texture invariance. Correspondingly, we introduce High- Resolution Depth Net (HRDepthNet)—a machine learning driven approach to detect human joints (body, head, and upper and lower extremities) in purely depth images. HRDepthNet retrains the original HRNet for depth images. Therefore, a dataset is created holding depth (and RGB) images recorded with subjects conducting the timed up and go test—an established geriatric assessment. The images were manually annotated RGB images. The training and evaluation were conducted with this dataset. For accuracy evaluation, detection of body joints was evaluated via COCO’s evaluation metrics and indicated that the resulting depth image-based model achieved better results than the HRNet trained and applied on corresponding RGB images. An additional evaluation of the position errors showed a median deviation of 1.619 cm (x-axis), 2.342 cm (y-axis) and 2.4 cm (z-axis).

Download Full-text

Using a Multilearner to Fuse Multimodal Features for Human Action Recognition

Mathematical Problems in Engineering ◽

10.1155/2020/4358728 ◽

2020 ◽

Vol 2020 ◽

pp. 1-18

Author(s):

Chao Tang ◽

Huosheng Hu ◽

Wenjian Wang ◽

Wei Li ◽

Hua Peng ◽

...

Keyword(s):

Action Recognition ◽

Feature Fusion ◽

Human Action Recognition ◽

Human Action ◽

Human Motion ◽

Image Features ◽

Depth Image ◽

Good Ability ◽

Multimodal Features ◽

Feature Based

The representation and selection of action features directly affect the recognition effect of human action recognition methods. Single feature is often affected by human appearance, environment, camera settings, and other factors. Aiming at the problem that the existing multimodal feature fusion methods cannot effectively measure the contribution of different features, this paper proposed a human action recognition method based on RGB-D image features, which makes full use of the multimodal information provided by RGB-D sensors to extract effective human action features. In this paper, three kinds of human action features with different modal information are proposed: RGB-HOG feature based on RGB image information, which has good geometric scale invariance; D-STIP feature based on depth image, which maintains the dynamic characteristics of human motion and has local invariance; and S-JRPF feature-based skeleton information, which has good ability to describe motion space structure. At the same time, multiple K-nearest neighbor classifiers with better generalization ability are used to integrate decision-making classification. The experimental results show that the algorithm achieves ideal recognition results on the public G3D and CAD60 datasets.

Download Full-text

Using nonlocal filtering and feature extraction approaches in three-dimensional face recognition by Kinect

International Journal of Advanced Robotic Systems ◽

10.1177/1729881418787743 ◽

2018 ◽

Vol 15 (4) ◽

pp. 172988141878774 ◽

Cited By ~ 2

Author(s):

Shahram Mohammadi ◽

Omid Gervei

Keyword(s):

Feature Extraction ◽

Face Recognition ◽

Recognition Rate ◽

Three Dimensional ◽

Component Analysis ◽

Block Matching ◽

Depth Image ◽

Support Vector ◽

Depth Images ◽

Dimensional Face

To use low-cost depth sensors such as Kinect for three-dimensional face recognition with an acceptable rate of recognition, the challenges of filling up nonmeasured pixels and smoothing of noisy data need to be addressed. The main goal of this article is presenting solutions for aforementioned challenges as well as offering feature extraction methods to reach the highest level of accuracy in the presence of different facial expressions and occlusions. To use this method, a domestic database was created. First, the noisy pixels-called holes-of depth image is removed by solving multiple linear equations resulted from the values of the surrounding pixels of the holes. Then, bilateral and block matching 3-D filtering approaches, as representatives of local and nonlocal filtering approaches, are used for depth image smoothing. Curvelet transform as a well-known nonlocal feature extraction technique applied on both RGB and depth images. Two unsupervised dimension reduction techniques, namely, principal component analysis and independent component analysis, are used to reduce the dimension of extracted features. Finally, support vector machine is used for classification. Experimental results show a recognition rate of 90% for just depth images and 100% when combining RGB and depth data of a Kinect sensor which is much higher than other recently proposed algorithms.

Download Full-text

Automatic 3D Landmark Extraction System Based on an Encoder–Decoder Using Fusion of Vision and LiDAR

Remote Sensing ◽

10.3390/rs12071142 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1142

Author(s):

Jeonghoon Kwak ◽

Yunsick Sung

Keyword(s):

Point Cloud ◽

Point Clouds ◽

Depth Image ◽

3D Point Cloud ◽

Digital World ◽

Depth Images ◽

3D Point Clouds ◽

Rgb Images ◽

Rgb Image ◽

3D Landmarks

To provide a realistic environment for remote sensing applications, point clouds are used to realize a three-dimensional (3D) digital world for the user. Motion recognition of objects, e.g., humans, is required to provide realistic experiences in the 3D digital world. To recognize a user’s motions, 3D landmarks are provided by analyzing a 3D point cloud collected through a light detection and ranging (LiDAR) system or a red green blue (RGB) image collected visually. However, manual supervision is required to extract 3D landmarks as to whether they originate from the RGB image or the 3D point cloud. Thus, there is a need for a method for extracting 3D landmarks without manual supervision. Herein, an RGB image and a 3D point cloud are used to extract 3D landmarks. The 3D point cloud is utilized as the relative distance between a LiDAR and a user. Because it cannot contain all information the user’s entire body due to disparities, it cannot generate a dense depth image that provides the boundary of user’s body. Therefore, up-sampling is performed to increase the density of the depth image generated based on the 3D point cloud; the density depends on the 3D point cloud. This paper proposes a system for extracting 3D landmarks using 3D point clouds and RGB images without manual supervision. A depth image provides the boundary of a user’s motion and is generated by using 3D point cloud and RGB image collected by a LiDAR and an RGB camera, respectively. To extract 3D landmarks automatically, an encoder–decoder model is trained with the generated depth images, and the RGB images and 3D landmarks are extracted from these images with the trained encoder model. The method of extracting 3D landmarks using RGB depth (RGBD) images was verified experimentally, and 3D landmarks were extracted to evaluate the user’s motions with RGBD images. In this manner, landmarks could be extracted according to the user’s motions, rather than by extracting them using the RGB images. The depth images generated by the proposed method were 1.832 times denser than the up-sampling-based depth images generated with bilateral filtering.

Download Full-text

Fusion in Dissimilarity Space Between RGB D and Skeleton for Person Re Identification

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l9566.10101221 ◽

2021 ◽

Vol 10 (12) ◽

pp. 69-75

Author(s):

Md Kamal Uddin ◽

◽

Amran Bhuiyan ◽

Mahmudul Hasan ◽

◽

...

Keyword(s):

Image Features ◽

Identification Accuracy ◽

Depth Image ◽

Surveillance Systems ◽

Score Level Fusion ◽

Illumination Problem ◽

Additional Information ◽

Depth Images ◽

Lighting Conditions ◽

Combined Features

Person re-identification (Re-id) is one of the important tools of video surveillance systems, which aims to recognize an individual across the multiple disjoint sensors of a camera network. Despite the recent advances on RGB camera-based person re-identification methods under normal lighting conditions, Re-id researchers fail to take advantages of modern RGB-D sensor-based additional information (e.g. depth and skeleton information). When traditional RGB-based cameras fail to capture the video under poor illumination conditions, RGB-D sensor-based additional information can be advantageous to tackle these constraints. This work takes depth images and skeleton joint points as additional information along with RGB appearance cues and proposes a person re-identification method. We combine 4-channel RGB-D image features with skeleton information using score-level fusion strategy in dissimilarity space to increase re-identification accuracy. Moreover, our propose method overcomes the illumination problem because we use illumination invariant depth image and skeleton information. We carried out rigorous experiments on two publicly available RGBD-ID re-identification datasets and proved the use of combined features of 4-channel RGB-D images and skeleton information boost up the rank 1 recognition accuracy.

Download Full-text

We Learn Better Road Pothole Detection: from Attention Aggregation to Adversarial Domain Adaptation

10.36227/techrxiv.12813323.v1 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Junaid Bocus ◽

Ming Liu

Keyword(s):

Domain Adaptation ◽

State Of The Art ◽

Semantic Segmentation ◽

Depth Image ◽

The Road ◽

Depth Images ◽

Different Types ◽

Processing Module ◽

Rgb Images ◽

Pothole Detection

<div>Manual visual inspection, typically performed by certified inspectors, is still the main form of road pothole detection. This process is, however, not only tedious, time-consuming and costly, but also dangerous for the inspectors. Furthermore, the road pothole detection results are always subjective, because they depend entirely on the inspector's experience. In this paper, we first introduce a disparity (or inverse depth) image processing module, named quasi inverse perspective transformation (QIPT), which can make the damaged road areas become highly distinguishable. Then, we propose a novel attention aggregation (AA) framework, which can improve the semantic segmentation networks for better road pothole detection, by taking the advantages of different types of attention modules. Moreover, we develop a novel training set augmentation technique based on adversarial domain adaptation, where synthetic road RGB images and transformed road disparity (or inverse depth) images are generated to enhance the training of semantic segmentation networks.</div><div>The experimental results illustrate that, firstly, the disparity (or inverse depth) images transformed by our QIPT module become more informative; secondly, the adversarial domain adaptation can not only significantly improve the performance of the state-of-the-art semantic segmentation networks, but also accelerate their convergence. In addition, AA-UNet and AA-RTFNet, our best performing implementations, respectively outperform all other state-of-the-art single-modal and data-fusion networks for road pothole detection.</div>

Download Full-text

We Learn Better Road Pothole Detection: from Attention Aggregation to Adversarial Domain Adaptation

10.36227/techrxiv.12813323 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Junaid Bocus ◽

Ming Liu

Keyword(s):

Domain Adaptation ◽

State Of The Art ◽

Semantic Segmentation ◽

Depth Image ◽

The Road ◽

Depth Images ◽

Different Types ◽

Processing Module ◽

Rgb Images ◽

Pothole Detection

<div>Manual visual inspection, typically performed by certified inspectors, is still the main form of road pothole detection. This process is, however, not only tedious, time-consuming and costly, but also dangerous for the inspectors. Furthermore, the road pothole detection results are always subjective, because they depend entirely on the inspector's experience. In this paper, we first introduce a disparity (or inverse depth) image processing module, named quasi inverse perspective transformation (QIPT), which can make the damaged road areas become highly distinguishable. Then, we propose a novel attention aggregation (AA) framework, which can improve the semantic segmentation networks for better road pothole detection, by taking the advantages of different types of attention modules. Moreover, we develop a novel training set augmentation technique based on adversarial domain adaptation, where synthetic road RGB images and transformed road disparity (or inverse depth) images are generated to enhance the training of semantic segmentation networks.</div><div>The experimental results illustrate that, firstly, the disparity (or inverse depth) images transformed by our QIPT module become more informative; secondly, the adversarial domain adaptation can not only significantly improve the performance of the state-of-the-art semantic segmentation networks, but also accelerate their convergence. In addition, AA-UNet and AA-RTFNet, our best performing implementations, respectively outperform all other state-of-the-art single-modal and data-fusion networks for road pothole detection.</div>

Download Full-text

FRAME SAMPLING ASSISTED DEPTH MOTION MAPS FOR DEPTH BASED HUMAN ACTION RECOGNITION

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i1.124 ◽

2021 ◽

Vol 9 (1) ◽

pp. 240-246

Author(s):

Sivanagi Reddy Kalli, K. Mohanram, S. Jagadeesh

Keyword(s):

Action Recognition ◽

Image Data ◽

Human Action Recognition ◽

Human Action ◽

Superior Performance ◽

Depth Image ◽

Support Vector ◽

Depth Sensors ◽

Depth Images ◽

Key Frames

The discovery of depth sensors has brought new opportunities in the Human Action Research by providing depth image data. Compared to the conventional RGB image data, the depth image data has additional benefits like color, illumination invariant, and provides clues about the shape of body. Inspired with these benefits, we present a new Human Action Recognition model from depth images. For a given action video, the consideration of an entire frames constitutes less detailed information about the shape and movements of body. Hence we have proposed a new method called Frame Sampling to reduce the frame count and chooses only key frames. After key frames extraction, they are processed through Depth Motion Map for action representation followed by Support Vector Machine for classification. The developed model is evaluated on a standard public dataset captured by depth cameras. The experimental results demonstrate the superior performance compared with state-of-art methods

Download Full-text

We Learn Better Road Pothole Detection: from Attention Aggregation to Adversarial Domain Adaptation

10.36227/techrxiv.12813323.v2 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Junaid Bocus ◽

Ming Liu

Keyword(s):

Domain Adaptation ◽

State Of The Art ◽

Semantic Segmentation ◽

Depth Image ◽

The Road ◽

Depth Images ◽

Different Types ◽

Processing Module ◽

Rgb Images ◽

Pothole Detection

<div>Manual visual inspection, typically performed by certified inspectors, is still the main form of road pothole detection. This process is, however, not only tedious, time-consuming and costly, but also dangerous for the inspectors. Furthermore, the road pothole detection results are always subjective, because they depend entirely on the inspector's experience. In this paper, we first introduce a disparity (or inverse depth) image processing module, named quasi inverse perspective transformation (QIPT), which can make the damaged road areas become highly distinguishable. Then, we propose a novel attention aggregation (AA) framework, which can improve the semantic segmentation networks for better road pothole detection, by taking the advantages of different types of attention modules. Moreover, we develop a novel training set augmentation technique based on adversarial domain adaptation, where synthetic road RGB images and transformed road disparity (or inverse depth) images are generated to enhance the training of semantic segmentation networks.</div><div>The experimental results illustrate that, firstly, the disparity (or inverse depth) images transformed by our QIPT module become more informative; secondly, the adversarial domain adaptation can not only significantly improve the performance of the state-of-the-art semantic segmentation networks, but also accelerate their convergence. In addition, AA-UNet and AA-RTFNet, our best performing implementations, respectively outperform all other state-of-the-art single-modal and data-fusion networks for road pothole detection.</div>

Download Full-text

Indian Sign Language Recognition on PYNQ Board

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200909110140 ◽

2020 ◽

Vol 13 ◽

Author(s):

Sukhendra Singh ◽

G. N. Rathna ◽

Vivek Singhal

Keyword(s):

Sign Language ◽

Hand Gesture ◽

Language Recognition ◽

Sign Language Recognition ◽

Hand Gestures ◽

Depth Images ◽

Kinect Camera ◽

Impaired People ◽

Web Camera ◽

Indian Sign Language

Introduction: Sign language is the only way to communicate for speech-impaired people. But this sign language is not known to normal people so this is the cause of barrier in communicating. This is the problem faced by speech impaired people. In this paper, we have presented our solution which captured hand gestures with Kinect camera and classified the hand gesture into its correct symbol. Method: We used Kinect camera not the ordinary web camera because the ordinary camera does not capture its 3d orientation or depth of an image from camera however Kinect camera can capture 3d image and this will make classification more accurate. Result: Kinect camera will produce a different image for hand gestures for ‘2’ and ‘V’ and similarly for ‘1’ and ‘I’ however, normal web camera will not be able to distinguish between these two. We used hand gesture for Indian sign language and our dataset had 46339, RGB images and 46339 depth images. 80% of the total images were used for training and the remaining 20% for testing. In total 36 hand gestures were considered to capture alphabets and alphabets from A-Z and 10 for numeric, 26 for digits from 0-9 were considered to capture alphabets and Keywords. Conclusion: Along with real-time implementation, we have also shown the comparison of the performance of the various machine learning models in which we have found out the accuracy of CNN on depth- images has given the most accurate performance than other models. All these resulted were obtained on PYNQ Z2 board.

Download Full-text