Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

Yueying Kao; Weiming Li; Qiang Wang; Zhouchen Lin; Wooshik Kim; Sunghoon Hong

doi:10.1609/aaai.v34i07.6781

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6781 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11221-11228

Author(s):

Yueying Kao ◽

Weiming Li ◽

Qiang Wang ◽

Zhouchen Lin ◽

Wooshik Kim ◽

...

Keyword(s):

Pose Estimation ◽

Large Scale ◽

Synthetic Data ◽

Real Data ◽

Depth Image ◽

Depth Images ◽

In The Wild ◽

Object Pose Estimation ◽

Image Pairs ◽

Rgb Image

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

Download Full-text

RobotP: A Benchmark Dataset for 6D Object Pose Estimation

Sensors ◽

10.3390/s21041299 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1299

Author(s):

Honglin Yuan ◽

Tim Hoogenkamp ◽

Remco C. Veltkamp

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

3D Models ◽

Depth Image ◽

Great Success ◽

Estimation Algorithms ◽

Depth Images ◽

Object Pose Estimation ◽

Image Pairs ◽

Bounding Boxes

Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.

Download Full-text

WHSP-Net: A Weakly-Supervised Approach for 3D Hand Shape and Pose Recovery from a Single Depth Image

Sensors ◽

10.3390/s19173784 ◽

2019 ◽

Vol 19 (17) ◽

pp. 3784 ◽

Cited By ~ 7

Author(s):

Jameel Malik ◽

Ahmed Elhayek ◽

Didier Stricker

Keyword(s):

Pose Estimation ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Depth Image ◽

Estimation Methods ◽

Hand Shape ◽

Pose Recovery ◽

Real World Datasets ◽

Weakly Supervised

Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.

Download Full-text

Pose Estimation of Primitive-Shaped Objects from a Depth Image Using Superquadric Representation

Applied Sciences ◽

10.3390/app10165442 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5442

Author(s):

Ryo Hachiuma ◽

Hideo Saito

Keyword(s):

Pose Estimation ◽

Degrees Of Freedom ◽

Shape Representation ◽

Estimation Method ◽

Depth Image ◽

Six Degrees Of Freedom ◽

Depth Images ◽

Object Pose Estimation ◽

Primitive Shape ◽

Conventional Methods

This paper presents a method for estimating the six Degrees of Freedom (6DoF) pose of texture-less primitive-shaped objects from depth images. As the conventional methods for object pose estimation require rich texture or geometric features to the target objects, these methods are not suitable for texture-less and geometrically simple shaped objects. In order to estimate the pose of the primitive-shaped object, the parameters that represent primitive shapes are estimated. However, these methods explicitly limit the number of types of primitive shapes that can be estimated. We employ superquadrics as a primitive shape representation that can represent various types of primitive shapes with only a few parameters. In order to estimate the superquadric parameters of primitive-shaped objects, the point cloud of the object must be segmented from a depth image. It is known that the parameter estimation is sensitive to outliers, which are caused by the miss-segmentation of the depth image. Therefore, we propose a novel estimation method for superquadric parameters that are robust to outliers. In the experiment, we constructed a dataset in which the person grasps and moves the primitive-shaped objects. The experimental results show that our estimation method outperformed three conventional methods and the baseline method.

Download Full-text

Promising Depth Map Prediction Method from a Single Image Based on Conditional Generative Adversarial Network

10.3233/faia210159 ◽

2021 ◽

Author(s):

Saddam Abdulwahab ◽

Hatem A. Rashwan ◽

Armin Masoumian ◽

Najwa Sharaf ◽

Domenec Puig

Keyword(s):

Pose Estimation ◽

Prediction Method ◽

Depth Estimation ◽

Color Images ◽

Depth Image ◽

3D Pose Estimation ◽

3D Cad ◽

Depth Images ◽

Proposed Model ◽

Rgb Image

Pose estimation is typically performed through 3D images. In contrast, estimating the pose from a single RGB image is still a difficult task. RGB images do not only represent objects’ shape, but also represent the intensity that is relative to the viewpoint, texture, and lighting condition. While the 3D pose estimation from depth images is considered a promising approach since the depth image only represents objects’ shape. Thus, it is necessary to know what is the appropriate method that can be used for predicting the depth image from a 2D RGB image and then to use for getting the 3D pose estimation. In this paper, we propose a promising approach based on a deep learning model for depth estimation in order to improve the 3D pose estimation. The proposed model consists of two successive networks. The first network is an autoencoder network that maps from the RGB domain to the depth domain. The second network is a discriminator network that compares a real depth image to a generated depth image to support the first network to generate an accurate depth image. In this work, we do not use real depth images corresponding to the input color images. Our contribution is to use 3D CAD models corresponding to objects appearing in color images to render depth images from different viewpoints. These rendered images are then used as ground truth and to guide the autoencoder network to learn the mapping from the image domain to the depth domain. The proposed model outperforms state-of-the-art models on the publicly PASCAL 3D+ dataset.

Download Full-text

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6792 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11312-11319 ◽

Cited By ~ 1

Author(s):

Jogendra Nath Kundu ◽

Siddharth Seth ◽

Rahul M V ◽

Mugalodi Rakesh ◽

Venkatesh Babu Radhakrishnan ◽

...

Keyword(s):

Pose Estimation ◽

Large Scale ◽

3D Structure ◽

Human Pose Estimation ◽

Kinematic Structure ◽

In The Wild ◽

Monocular Image ◽

Human Pose ◽

Weakly Supervised ◽

Image Pairs

Estimation of 3D human pose from monocular image has gained considerable attention, as a key step to several human-centric applications. However, generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable, as these models often perform unsatisfactorily on unseen in-the-wild environments. Though weakly-supervised models have been proposed to address this shortcoming, performance of such models relies on availability of paired supervision on some related task, such as 2D pose or multi-view image pairs. In contrast, we propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions. Our pose estimation framework relies on a minimal set of prior knowledge that defines the underlying kinematic 3D structure, such as skeletal joint connectivity information with bone-length ratios in a fixed canonical scale. The proposed model employs three consecutive differentiable transformations namely forward-kinematics, camera-projection and spatial-map transformation. This design not only acts as a suitable bottleneck stimulating effective pose disentanglement, but also yields interpretable latent pose representations avoiding training of an explicit latent embedding to pose mapper. Furthermore, devoid of unstable adversarial setup, we re-utilize the decoder to formalize an energy-based loss, which enables us to learn from in-the-wild videos, beyond laboratory settings. Comprehensive experiments demonstrate our state-of-the-art unsupervised and weakly-supervised pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets. Qualitative results on unseen environments further establish our superior generalization ability.

Download Full-text

Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions

ACM Transactions on Graphics ◽

10.1145/3451341 ◽

2021 ◽

Vol 40 (3) ◽

pp. 1-12

Author(s):

Hao Zhang ◽

Yuxiao Zhou ◽

Yifei Tian ◽

Jun-Hai Yong ◽

Feng Xu

Keyword(s):

Real Time ◽

Synthetic Data ◽

Real Data ◽

Depth Image ◽

Real Time System ◽

The Real ◽

Time Performance ◽

Contact Constraint ◽

Object Shapes ◽

Object Interactions

Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.

Download Full-text

CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation

Cognitive Computation ◽

10.1007/s12559-021-09966-y ◽

2021 ◽

Author(s):

Tao Chen ◽

Dongbing Gu

Keyword(s):

Spatial Attention ◽

Pose Estimation ◽

Estimation Accuracy ◽

Pose Prediction ◽

Attention Networks ◽

Depth Images ◽

Segmented Object ◽

Benchmark Datasets ◽

Object Pose Estimation ◽

Network Channel

Abstract6D object pose estimation plays a crucial role in robotic manipulation and grasping tasks. The aim to estimate the 6D object pose from RGB or RGB-D images is to detect objects and estimate their orientations and translations relative to the given canonical models. RGB-D cameras provide two sensory modalities: RGB and depth images, which could benefit the estimation accuracy. But the exploitation of two different modality sources remains a challenging issue. In this paper, inspired by recent works on attention networks that could focus on important regions and ignore unnecessary information, we propose a novel network: Channel-Spatial Attention Network (CSA6D) to estimate the 6D object pose from RGB-D camera. The proposed CSA6D includes a pre-trained 2D network to segment the interested objects from RGB image. Then it uses two separate networks to extract appearance and geometrical features from RGB and depth images for each segmented object. Two feature vectors for each pixel are stacked together as a fusion vector which is refined by an attention module to generate a aggregated feature vector. The attention module includes a channel attention block and a spatial attention block which can effectively leverage the concatenated embeddings into accurate 6D pose prediction on known objects. We evaluate proposed network on two benchmark datasets YCB-Video dataset and LineMod dataset and the results show it can outperform previous state-of-the-art methods under ADD and ADD-S metrics. Also, the attention map demonstrates our proposed network searches for the unique geometry information as the most likely features for pose estimation. From experiments, we conclude that the proposed network can accurately estimate the object pose by effectively leveraging multi-modality features.

Download Full-text

Large‐scale three‐dimensional seismic models and their interpretive significance

Geophysics ◽

10.1190/1.1442933 ◽

1990 ◽

Vol 55 (9) ◽

pp. 1166-1182 ◽

Cited By ~ 21

Author(s):

Irshad R. Mufti

Keyword(s):

Finite Difference ◽

Wave Field ◽

Large Scale ◽

Three Dimensional ◽

Synthetic Data ◽

Real Data ◽

The Real ◽

Need To Evaluate ◽

Zero Offset ◽

Seismic Models

Finite‐difference seismic models are commonly set up in 2-D space. Such models must be excited by a line source which leads to different amplitudes than those in the real data commonly generated from a point source. Moreover, there is no provision for any out‐of‐plane events. These problems can be eliminated by using 3-D finite‐difference models. The fundamental strategy in designing efficient 3-D models is to minimize computational work without sacrificing accuracy. This was accomplished by using a (4,2) differencing operator which ensures the accuracy of much larger operators but requires many fewer numerical operations as well as significantly reduced manipulation of data in the computer memory. Such a choice also simplifies the problem of evaluating the wave field near the subsurface boundaries of the model where large operators cannot be used. We also exploited the fact that, unlike the real data, the synthetic data are free from ambient noise; consequently, one can retain sufficient resolution in the results by optimizing the frequency content of the source signal. Further computational efficiency was achieved by using the concept of the exploding reflector which yields zero‐offset seismic sections without the need to evaluate the wave field for individual shot locations. These considerations opened up the possibility of carrying out a complete synthetic 3-D survey on a supercomputer to investigate the seismic response of a large‐scale structure located in Oklahoma. The analysis of results done on a geophysical workstation provides new insight regarding the role of interference and diffraction in the interpretation of seismic data.

Download Full-text

DGGAN: Depth-image Guided Generative Adversarial Networks for Disentangling RGB and Depth Images in 3D Hand Pose Estimation

2020 IEEE Winter Conference on Applications of Computer Vision (WACV) ◽

10.1109/wacv45572.2020.9093380 ◽

2020 ◽

Author(s):

Liangjian Chen ◽

Shih-Yao Lin ◽

Yusheng Xie ◽

Yen-Yu Lin ◽

Wei Fan ◽

...

Keyword(s):

Pose Estimation ◽

Depth Image ◽

Generative Adversarial Networks ◽

Hand Pose Estimation ◽

Image Guided ◽

Depth Images ◽

Adversarial Networks ◽

Hand Pose

Download Full-text

Multiple Classifiers-Based Feature Fusion for RGB-D Object Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500148 ◽

2017 ◽

Vol 31 (05) ◽

pp. 1750014 ◽

Cited By ~ 3

Author(s):

Yan Wu ◽

Jiqian Li ◽

Jing Bai

Keyword(s):

Object Recognition ◽

Feature Fusion ◽

Classification Performance ◽

Depth Image ◽

Depth Information ◽

The Past ◽

Depth Images ◽

Comparable Performance ◽

Accuracy Difference ◽

Rgb Image

RGB-D-based object recognition has been enthusiastically investigated in the past few years. RGB and depth images provide useful and complementary information. Fusing RGB and depth features can significantly increase the accuracy of object recognition. However, previous works just simply take the depth image as the fourth channel of the RGB image and concatenate the RGB and depth features, ignoring the different power of RGB and depth information for different objects. In this paper, a new method which contains three different classifiers is proposed to fuse features extracted from RGB image and depth image for RGB-D-based object recognition. Firstly, a RGB classifier and a depth classifier are trained by cross-validation to get the accuracy difference between RGB and depth features for each object. Then a variant RGB-D classifier is trained with different initialization parameters for each class according to the accuracy difference. The variant RGB-D-classifier can result in a more robust classification performance. The proposed method is evaluated on two benchmark RGB-D datasets. Compared with previous methods, ours achieves comparable performance with the state-of-the-art method.

Download Full-text