scholarly journals WHSP-Net: A Weakly-Supervised Approach for 3D Hand Shape and Pose Recovery from a Single Depth Image

Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3784 ◽  
Author(s):  
Jameel Malik ◽  
Ahmed Elhayek ◽  
Didier Stricker

Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2020 ◽  
Vol 34 (07) ◽  
pp. 11221-11228
Author(s):  
Yueying Kao ◽  
Weiming Li ◽  
Qiang Wang ◽  
Zhouchen Lin ◽  
Wooshik Kim ◽  
...  

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.


Author(s):  
Jonas Hein ◽  
Matthias Seibold ◽  
Federica Bogo ◽  
Mazda Farshad ◽  
Marc Pollefeys ◽  
...  

Abstract Purpose:  Tracking of tools and surgical activity is becoming more and more important in the context of computer assisted surgery. In this work, we present a data generation framework, dataset and baseline methods to facilitate further research in the direction of markerless hand and instrument pose estimation in realistic surgical scenarios. Methods:  We developed a rendering pipeline to create inexpensive and realistic synthetic data for model pretraining. Subsequently, we propose a pipeline to capture and label real data with hand and object pose ground truth in an experimental setup to gather high-quality real data. We furthermore present three state-of-the-art RGB-based pose estimation baselines. Results:  We evaluate three baseline models on the proposed datasets. The best performing baseline achieves an average tool 3D vertex error of 16.7 mm on synthetic data as well as 13.8 mm on real data which is comparable to the state-of-the art in RGB-based hand/object pose estimation. Conclusion:  To the best of our knowledge, we propose the first synthetic and real data generation pipelines to generate hand and object pose labels for open surgery. We present three baseline models for RGB based object and object/hand pose estimation based on RGB frames. Our realistic synthetic data generation pipeline may contribute to overcome the data bottleneck in the surgical domain and can easily be transferred to other medical applications.


2021 ◽  
Vol 40 (3) ◽  
pp. 1-12
Author(s):  
Hao Zhang ◽  
Yuxiao Zhou ◽  
Yifei Tian ◽  
Jun-Hai Yong ◽  
Feng Xu

Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.


Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1299
Author(s):  
Honglin Yuan ◽  
Tim Hoogenkamp ◽  
Remco C. Veltkamp

Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Francesca Pizzorni Ferrarese ◽  
Flavio Simonetti ◽  
Roberto Israel Foroni ◽  
Gloria Menegaz

Validation and accuracy assessment are the main bottlenecks preventing the adoption of image processing algorithms in the clinical practice. In the classical approach, a posteriori analysis is performed through objective metrics. In this work, a different approach based on Petri nets is proposed. The basic idea consists in predicting the accuracy of a given pipeline based on the identification and characterization of the sources of inaccuracy. The concept is demonstrated on a case study: intrasubject rigid and affine registration of magnetic resonance images. Both synthetic and real data are considered. While synthetic data allow the benchmarking of the performance with respect to the ground truth, real data enable to assess the robustness of the methodology in real contexts as well as to determine the suitability of the use of synthetic data in the training phase. Results revealed a higher correlation and a lower dispersion among the metrics for simulated data, while the opposite trend was observed for pathologic ones. Results show that the proposed model not only provides a good prediction performance but also leads to the optimization of the end-to-end chain in terms of accuracy and robustness, setting the ground for its generalization to different and more complex scenarios.


2021 ◽  
Author(s):  
Zhimin Zhang ◽  
◽  
Jianzhong Qiao ◽  
Shukuan Lin ◽  
◽  
...  

The depth and pose information are the basic issues in the field of robotics, autonomous driving, and virtual reality, and are also the focus and difficult issues of computer vision research. The supervised monocular depth and pose estimation learning are not feasible in environments where labeled data is not abundant. Self-supervised monocular video methods can learn effectively only by applying photometric constraints without expensive ground true depth label constraints, which results in an inefficient training process and suboptimal estimation accuracy. To solve these problems, a monocular weakly supervised depth and pose estimation method based on multi-information fusion is proposed in this paper. First, we design a high-precision stereo matching method to generate a depth and pose data as the "Ground Truth" labels to solve the problem that the ground truth labels are difficult to obtain. Then, we construct a multi-information fusion network model based on the "Ground truth" labels, video sequence, and IMU information to improve the estimation accuracy. Finally, we design the loss function of supervised cues based on "Ground Truth" labels cues and self-supervised cues to optimize our model. In the testing phase, the network model can separately output high-precision depth and pose data from a monocular video sequence. The resulting model outperforms mainstream monocular depth and poses estimation methods as well as the partial stereo matching method in the challenging KITTI dataset by only using a small number of real training data(200 pairs).


2021 ◽  
Vol 11 (17) ◽  
pp. 8047
Author(s):  
Dongkyu Lee ◽  
Wee Peng Tay ◽  
Seok-Cheol Kee

In this work, a study was carried out to estimate a look-up table (LUT) that converts a camera image plane to a birds eye view (BEV) plane using a single camera. The traditional camera pose estimation fields require high costs in researching and manufacturing autonomous vehicles for the future and may require pre-configured infra. This paper proposes an autonomous vehicle driving camera calibration system that is low cost and utilizes low infra. A network that outputs an image in the form of an LUT that converts the image into a BEV by estimating the camera pose under urban road driving conditions using a single camera was studied. We propose a network that predicts human-like poses from a single image. We collected synthetic data using a simulator, made BEV and LUT as ground truth, and utilized the proposed network and ground truth to train pose estimation function. In the progress, it predicts the pose by deciphering the semantic segmentation feature and increases its performance by attaching a layer that handles the overall direction of the network. The network outputs camera angle (roll/pitch/yaw) on the 3D coordinate system so that the user can monitor learning. Since the network's output is a LUT, there is no need for additional calculation, and real-time performance is improved.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8477
Author(s):  
Roozbeh Mohammadi ◽  
Claudio Roncoli

Connected vehicles (CVs) have the potential to collect and share information that, if appropriately processed, can be employed for advanced traffic control strategies, rendering infrastructure-based sensing obsolete. However, before we reach a fully connected environment, where all vehicles are CVs, we have to deal with the challenge of incomplete data. In this paper, we develop data-driven methods for the estimation of vehicles approaching a signalised intersection, based on the availability of partial information stemming from an unknown penetration rate of CVs. In particular, we build machine learning models with the aim of capturing the nonlinear relations between the inputs (CV data) and the output (number of non-connected vehicles), which are characterised by highly complex interactions and may be affected by a large number of factors. We show that, in order to train these models, we may use data that can be easily collected with modern technologies. Moreover, we demonstrate that, if the available real data is not deemed sufficient, training can be performed using synthetic data, produced via microscopic simulations calibrated with real data, without a significant loss of performance. Numerical experiments, where the estimation methods are tested using real vehicle data simulating the presence of various penetration rates of CVs, show very good performance of the estimators, making them promising candidates for applications in the near future.


Author(s):  
Z. Li ◽  
J. Wang

Least squares image matching (LSM) has been extensively applied and researched for high matching accuracy. However, it still suffers from some problems. Firstly, it needs the appropriate estimate of initial value. However, in practical applications, initial values may contain some biases from the inaccurate positions of keypoints. Such biases, if high enough, may lead to a divergent solution. If all the matching biases have exactly the same magnitude and direction, then they can be regarded as systematic errors. Secondly, malfunction of an imaging sensor may happen, which generates dead or stuck pixels on the image. This can be referred as outliers statistically. Because least squares estimation is well known for its inability to resist outliers, all these mentioned deviations from the model determined by LSM cause a matching failure. To solve these problems, with simulation data and real data, a series of experiments considering systematic errors and outliers are designed, and a variety of robust estimation methods including RANSACbased method, M estimator, S estimator and MM estimator is applied and compared in LSM. In addition, an evaluation criterion directly related to the ground truth is proposed for performance comparison of these robust estimators. It is found that robust estimators show the robustness for these deviations compared with LSM. Among these the robust estimators, M and MM estimator have the best performances.


Sign in / Sign up

Export Citation Format

Share Document