Promising Depth Map Prediction Method from a Single Image Based on Conditional Generative Adversarial Network

Pose estimation is typically performed through 3D images. In contrast, estimating the pose from a single RGB image is still a difficult task. RGB images do not only represent objects’ shape, but also represent the intensity that is relative to the viewpoint, texture, and lighting condition. While the 3D pose estimation from depth images is considered a promising approach since the depth image only represents objects’ shape. Thus, it is necessary to know what is the appropriate method that can be used for predicting the depth image from a 2D RGB image and then to use for getting the 3D pose estimation. In this paper, we propose a promising approach based on a deep learning model for depth estimation in order to improve the 3D pose estimation. The proposed model consists of two successive networks. The first network is an autoencoder network that maps from the RGB domain to the depth domain. The second network is a discriminator network that compares a real depth image to a generated depth image to support the first network to generate an accurate depth image. In this work, we do not use real depth images corresponding to the input color images. Our contribution is to use 3D CAD models corresponding to objects appearing in color images to render depth images from different viewpoints. These rendered images are then used as ground truth and to guide the autoencoder network to learn the mapping from the image domain to the depth domain. The proposed model outperforms state-of-the-art models on the publicly PASCAL 3D+ dataset.

Download Full-text

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6781 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11221-11228

Author(s):

Yueying Kao ◽

Weiming Li ◽

Qiang Wang ◽

Zhouchen Lin ◽

Wooshik Kim ◽

...

Keyword(s):

Pose Estimation ◽

Large Scale ◽

Synthetic Data ◽

Real Data ◽

Depth Image ◽

Depth Images ◽

In The Wild ◽

Object Pose Estimation ◽

Image Pairs ◽

Rgb Image

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

Download Full-text

RobotP: A Benchmark Dataset for 6D Object Pose Estimation

Sensors ◽

10.3390/s21041299 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1299

Author(s):

Honglin Yuan ◽

Tim Hoogenkamp ◽

Remco C. Veltkamp

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

3D Models ◽

Depth Image ◽

Great Success ◽

Estimation Algorithms ◽

Depth Images ◽

Object Pose Estimation ◽

Image Pairs ◽

Bounding Boxes

Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.

Download Full-text

HRDepthNet: Depth Image-Based Marker-Less Tracking of Body Joints

Sensors ◽

10.3390/s21041356 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1356

Author(s):

Linda Christin Büker ◽

Finnja Zuber ◽

Andreas Hein ◽

Sebastian Fudickar

Keyword(s):

Color Images ◽

Depth Image ◽

Accuracy Evaluation ◽

Timed Up And Go ◽

Position Errors ◽

Depth Images ◽

Upper And Lower Extremities ◽

Rgb Images ◽

Human Joints ◽

Body Joints

With approaches for the detection of joint positions in color images such as HRNet and OpenPose being available, consideration of corresponding approaches for depth images is limited even though depth images have several advantages over color images like robustness to light variation or color- and texture invariance. Correspondingly, we introduce High- Resolution Depth Net (HRDepthNet)—a machine learning driven approach to detect human joints (body, head, and upper and lower extremities) in purely depth images. HRDepthNet retrains the original HRNet for depth images. Therefore, a dataset is created holding depth (and RGB) images recorded with subjects conducting the timed up and go test—an established geriatric assessment. The images were manually annotated RGB images. The training and evaluation were conducted with this dataset. For accuracy evaluation, detection of body joints was evaluated via COCO’s evaluation metrics and indicated that the resulting depth image-based model achieved better results than the HRNet trained and applied on corresponding RGB images. An additional evaluation of the position errors showed a median deviation of 1.619 cm (x-axis), 2.342 cm (y-axis) and 2.4 cm (z-axis).

Download Full-text

DGGAN: Depth-image Guided Generative Adversarial Networks for Disentangling RGB and Depth Images in 3D Hand Pose Estimation

2020 IEEE Winter Conference on Applications of Computer Vision (WACV) ◽

10.1109/wacv45572.2020.9093380 ◽

2020 ◽

Author(s):

Liangjian Chen ◽

Shih-Yao Lin ◽

Yusheng Xie ◽

Yen-Yu Lin ◽

Wei Fan ◽

...

Keyword(s):

Pose Estimation ◽

Depth Image ◽

Generative Adversarial Networks ◽

Hand Pose Estimation ◽

Image Guided ◽

Depth Images ◽

Adversarial Networks ◽

Hand Pose

Download Full-text

Multiple Classifiers-Based Feature Fusion for RGB-D Object Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500148 ◽

2017 ◽

Vol 31 (05) ◽

pp. 1750014 ◽

Cited By ~ 3

Author(s):

Yan Wu ◽

Jiqian Li ◽

Jing Bai

Keyword(s):

Object Recognition ◽

Feature Fusion ◽

Classification Performance ◽

Depth Image ◽

Depth Information ◽

The Past ◽

Depth Images ◽

Comparable Performance ◽

Accuracy Difference ◽

Rgb Image

RGB-D-based object recognition has been enthusiastically investigated in the past few years. RGB and depth images provide useful and complementary information. Fusing RGB and depth features can significantly increase the accuracy of object recognition. However, previous works just simply take the depth image as the fourth channel of the RGB image and concatenate the RGB and depth features, ignoring the different power of RGB and depth information for different objects. In this paper, a new method which contains three different classifiers is proposed to fuse features extracted from RGB image and depth image for RGB-D-based object recognition. Firstly, a RGB classifier and a depth classifier are trained by cross-validation to get the accuracy difference between RGB and depth features for each object. Then a variant RGB-D classifier is trained with different initialization parameters for each class according to the accuracy difference. The variant RGB-D-classifier can result in a more robust classification performance. The proposed method is evaluated on two benchmark RGB-D datasets. Compared with previous methods, ours achieves comparable performance with the state-of-the-art method.

Download Full-text

A Novel Monocular Visual Odometer Method Based on Kinect and Improved SURF Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.4081 ◽

2014 ◽

Vol 556-562 ◽

pp. 4081-4084

Author(s):

Li Jun Zhang ◽

Fei Chen

Keyword(s):

Dynamic Environment ◽

Mean Value ◽

Color Images ◽

Least Square ◽

Depth Image ◽

Mean Value Theorem ◽

Feature Points ◽

Kinect Sensor ◽

Depth Images ◽

Visual Odometer

The paper proposes a novel monocular visual odometer method based on Kinect sensor made by Microsoft and the improved SURF algorithm. Firstly the Kinect sensor capture color images and depth images of the surrounding environment, then we use the improved SURF algorithm to extract feature points of the color images and match for them. At last, map what we get with the depth image and estimate the path information of the robot by doing 3D reconstruction and using the the least square mean value theorem. Experimental results show that by using this new method, the average matching accuracy reaches 92.6%. And even in a dynamic environment, it shows good robustness, so it comes down to the conclusion that the combination of the Kinect sensor and the improved SURF algorithm applied to visual odometer is a simple and effective method.

Download Full-text

Full Resolution Dense Depth Recovery by Fusing RGB Images and Sparse Depth

10.36227/techrxiv.11687193.v1 ◽

2020 ◽

Author(s):

Guoliang Liu

Keyword(s):

State Of The Art ◽

Depth Estimation ◽

Depth Image ◽

Estimation Accuracy ◽

Estimation Result ◽

Recovery Method ◽

Depth Recovery ◽

Full Resolution ◽

Rgb Images ◽

Rgb Image

Full resolution depth is required in many realworld engineering applications. However, exist depth sensorsonly offer sparse depth sample points with limited resolutionand noise, e.g., LiDARs. We here propose a deep learningbased full resolution depth recovery method from monocularimages and corresponding sparse depth measurements of targetenvironment. The novelty of our idea is that the structure similarinformation between the RGB image and depth image is used torefine the dense depth estimation result. This important similarstructure information can be found using a correlation layerin the regression neural network. We show that the proposedmethod can achieve higher estimation accuracy compared tothe state of the art methods. The experiments conducted on theNYU Depth V2 prove the novelty of our idea.<br>

Download Full-text

Pose Estimation of Primitive-Shaped Objects from a Depth Image Using Superquadric Representation

Applied Sciences ◽

10.3390/app10165442 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5442

Author(s):

Ryo Hachiuma ◽

Hideo Saito

Keyword(s):

Pose Estimation ◽

Degrees Of Freedom ◽

Shape Representation ◽

Estimation Method ◽

Depth Image ◽

Six Degrees Of Freedom ◽

Depth Images ◽

Object Pose Estimation ◽

Primitive Shape ◽

Conventional Methods

This paper presents a method for estimating the six Degrees of Freedom (6DoF) pose of texture-less primitive-shaped objects from depth images. As the conventional methods for object pose estimation require rich texture or geometric features to the target objects, these methods are not suitable for texture-less and geometrically simple shaped objects. In order to estimate the pose of the primitive-shaped object, the parameters that represent primitive shapes are estimated. However, these methods explicitly limit the number of types of primitive shapes that can be estimated. We employ superquadrics as a primitive shape representation that can represent various types of primitive shapes with only a few parameters. In order to estimate the superquadric parameters of primitive-shaped objects, the point cloud of the object must be segmented from a depth image. It is known that the parameter estimation is sensitive to outliers, which are caused by the miss-segmentation of the depth image. Therefore, we propose a novel estimation method for superquadric parameters that are robust to outliers. In the experiment, we constructed a dataset in which the person grasps and moves the primitive-shaped objects. The experimental results show that our estimation method outperformed three conventional methods and the baseline method.

Download Full-text

Model-Based 3D Pose Estimation of a Single RGB Image Using a Deep Viewpoint Classification Neural Network

Applied Sciences ◽

10.3390/app9122478 ◽

2019 ◽

Vol 9 (12) ◽

pp. 2478 ◽

Cited By ~ 2

Author(s):

Jui-Yuan Su ◽

Shyi-Chyi Cheng ◽

Chin-Chun Chang ◽

Jing-Ming Chen

Keyword(s):

Neural Network ◽

Pose Estimation ◽

Low Cost ◽

Estimation Algorithm ◽

Training Dataset ◽

3D Pose Estimation ◽

Estimation Algorithms ◽

Model Based ◽

3D Scene ◽

Rgb Image

This paper presents a model-based approach for 3D pose estimation of a single RGB image to keep the 3D scene model up-to-date using a low-cost camera. A prelearned image model of the target scene is first reconstructed using a training RGB-D video. Next, the model is analyzed using the proposed multiple principal analysis to label the viewpoint class of each training RGB image and construct a training dataset for training a deep learning viewpoint classification neural network (DVCNN). For all training images in a viewpoint class, the DVCNN estimates their membership probabilities and defines the template of the class as the one of the highest probability. To achieve the goal of scene reconstruction in a 3D space using a camera, using the information of templates, a pose estimation algorithm follows to estimate the pose parameters and depth map of a single RGB image captured by navigating the camera to a specific viewpoint. Obviously, the pose estimation algorithm is the key to success for updating the status of the 3D scene. To compare with conventional pose estimation algorithms which use sparse features for pose estimation, our approach enhances the quality of reconstructing the 3D scene point cloud using the template-to-frame registration. Finally, we verify the ability of the established reconstruction system on publicly available benchmark datasets and compare it with the state-of-the-art pose estimation algorithms. The results indicate that our approach outperforms the compared methods in terms of the accuracy of pose estimation.

Download Full-text

Automatic 3D Landmark Extraction System Based on an Encoder–Decoder Using Fusion of Vision and LiDAR

Remote Sensing ◽

10.3390/rs12071142 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1142

Author(s):

Jeonghoon Kwak ◽

Yunsick Sung

Keyword(s):

Point Cloud ◽

Point Clouds ◽

Depth Image ◽

3D Point Cloud ◽

Digital World ◽

Depth Images ◽

3D Point Clouds ◽

Rgb Images ◽

Rgb Image ◽

3D Landmarks

To provide a realistic environment for remote sensing applications, point clouds are used to realize a three-dimensional (3D) digital world for the user. Motion recognition of objects, e.g., humans, is required to provide realistic experiences in the 3D digital world. To recognize a user’s motions, 3D landmarks are provided by analyzing a 3D point cloud collected through a light detection and ranging (LiDAR) system or a red green blue (RGB) image collected visually. However, manual supervision is required to extract 3D landmarks as to whether they originate from the RGB image or the 3D point cloud. Thus, there is a need for a method for extracting 3D landmarks without manual supervision. Herein, an RGB image and a 3D point cloud are used to extract 3D landmarks. The 3D point cloud is utilized as the relative distance between a LiDAR and a user. Because it cannot contain all information the user’s entire body due to disparities, it cannot generate a dense depth image that provides the boundary of user’s body. Therefore, up-sampling is performed to increase the density of the depth image generated based on the 3D point cloud; the density depends on the 3D point cloud. This paper proposes a system for extracting 3D landmarks using 3D point clouds and RGB images without manual supervision. A depth image provides the boundary of a user’s motion and is generated by using 3D point cloud and RGB image collected by a LiDAR and an RGB camera, respectively. To extract 3D landmarks automatically, an encoder–decoder model is trained with the generated depth images, and the RGB images and 3D landmarks are extracted from these images with the trained encoder model. The method of extracting 3D landmarks using RGB depth (RGBD) images was verified experimentally, and 3D landmarks were extracted to evaluate the user’s motions with RGBD images. In this manner, landmarks could be extracted according to the user’s motions, rather than by extracting them using the RGB images. The depth images generated by the proposed method were 1.832 times denser than the up-sampling-based depth images generated with bilateral filtering.

Download Full-text