scholarly journals Monocular weakly supervised depth and pose estimation method based on multi-information fusion

2021 ◽  
Author(s):  
Zhimin Zhang ◽  
◽  
Jianzhong Qiao ◽  
Shukuan Lin ◽  
◽  
...  

The depth and pose information are the basic issues in the field of robotics, autonomous driving, and virtual reality, and are also the focus and difficult issues of computer vision research. The supervised monocular depth and pose estimation learning are not feasible in environments where labeled data is not abundant. Self-supervised monocular video methods can learn effectively only by applying photometric constraints without expensive ground true depth label constraints, which results in an inefficient training process and suboptimal estimation accuracy. To solve these problems, a monocular weakly supervised depth and pose estimation method based on multi-information fusion is proposed in this paper. First, we design a high-precision stereo matching method to generate a depth and pose data as the "Ground Truth" labels to solve the problem that the ground truth labels are difficult to obtain. Then, we construct a multi-information fusion network model based on the "Ground truth" labels, video sequence, and IMU information to improve the estimation accuracy. Finally, we design the loss function of supervised cues based on "Ground Truth" labels cues and self-supervised cues to optimize our model. In the testing phase, the network model can separately output high-precision depth and pose data from a monocular video sequence. The resulting model outperforms mainstream monocular depth and poses estimation methods as well as the partial stereo matching method in the challenging KITTI dataset by only using a small number of real training data(200 pairs).

Symmetry ◽  
2019 ◽  
Vol 11 (5) ◽  
pp. 690
Author(s):  
Zhimin Zhang ◽  
Jianzhong Qiao ◽  
Shukuan Lin

Supervised monocular depth estimation methods based on learning have shown promising results compared with the traditional methods. However, these methods require a large number of high-quality corresponding ground truth depth data as supervision labels. Due to the limitation of acquisition equipment, it is expensive and impractical to record ground truth depth for different scenes. Compared to supervised methods, the self-supervised monocular depth estimation method without using ground truth depth is a promising research direction, but self-supervised depth estimation from a single image is geometrically ambiguous and suboptimal. In this paper, we propose a novel semi-supervised monocular stereo matching method based on existing approaches to improve the accuracy of depth estimation. This idea is inspired by the experimental results of the paper that the depth estimation accuracy of a stereo pair as input is better than that of a monocular view as input in the same self-supervised network model. Therefore, we decompose the monocular depth estimation problem into two sub-problems, a right view synthesized process followed by a semi-supervised stereo matching process. In order to improve the accuracy of the synthetic right view, we innovate beyond the existing view synthesis method Deep3D by adding a left-right consistency constraint and a smoothness constraint. To reduce the error caused by the reconstructed right view, we propose a semi-supervised stereo matching model that makes use of disparity maps generated by a self-supervised stereo matching model as the supervision cues and joint self-supervised cues to optimize the stereo matching network. In the test, the two networks are able to predict the depth map directly from a single image by pipeline connecting. Both procedures not only obey geometric principles, but also improve estimation accuracy. Test results on the KITTI dataset show that this method is superior to the current mainstream monocular self-supervised depth estimation methods under the same condition.


2021 ◽  
Author(s):  
Dengqing Tang ◽  
Lincheng Shen ◽  
Xiaojiao Xiang ◽  
Han Zhou ◽  
Tianjiang Hu

<p>We propose a learning-type anchors-driven real-time pose estimation method for the autolanding fixed-wing unmanned aerial vehicle (UAV). The proposed method enables online tracking of both position and attitude by the ground stereo vision system in the Global Navigation Satellite System denied environments. A pipeline of convolutional neural network (CNN)-based UAV anchors detection and anchors-driven UAV pose estimation are employed. To realize robust and accurate anchors detection, we design and implement a Block-CNN architecture to reduce the impact of the outliers. With the basis of the anchors, monocular and stereo vision-based filters are established to update the UAV position and attitude. To expand the training dataset without extra outdoor experiments, we develop a parallel system containing the outdoor and simulated systems with the same configuration. Simulated and outdoor experiments are performed to demonstrate the remarkable pose estimation accuracy improvement compared with the conventional Perspective-N-Points solution. In addition, the experiments also validate the feasibility of the proposed architecture and algorithm in terms of the accuracy and real-time capability requirements for fixed-wing autolanding UAVs.</p>


Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3784 ◽  
Author(s):  
Jameel Malik ◽  
Ahmed Elhayek ◽  
Didier Stricker

Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.


2020 ◽  
Vol 10 (24) ◽  
pp. 8866
Author(s):  
Sangyoon Lee ◽  
Hyunki Hong ◽  
Changkyoung Eem

Deep learning has been utilized in end-to-end camera pose estimation. To improve the performance, we introduce a camera pose estimation method based on a 2D-3D matching scheme with two convolutional neural networks (CNNs). The scene is divided into voxels, whose size and number are computed according to the scene volume and the number of 3D points. We extract inlier points from the 3D point set in a voxel using random sample consensus (RANSAC)-based plane fitting to obtain a set of interest points consisting of a major plane. These points are subsequently reprojected onto the image using the ground truth camera pose, following which a polygonal region is identified in each voxel using the convex hull. We designed a training dataset for 2D–3D matching, consisting of inlier 3D points, correspondence across image pairs, and the voxel regions in the image. We trained the hierarchical learning structure with two CNNs on the dataset architecture to detect the voxel regions and obtain the location/description of the interest points. Following successful 2D–3D matching, the camera pose was estimated using n-point pose solver in RANSAC. The experiment results show that our method can estimate the camera pose more precisely than previous end-to-end estimators.


2021 ◽  
Author(s):  
Dengqing Tang ◽  
Lincheng Shen ◽  
Xiaojiao Xiang ◽  
Han Zhou ◽  
Tianjiang Hu

<p>We propose a learning-type anchors-driven real-time pose estimation method for the autolanding fixed-wing unmanned aerial vehicle (UAV). The proposed method enables online tracking of both position and attitude by the ground stereo vision system in the Global Navigation Satellite System denied environments. A pipeline of convolutional neural network (CNN)-based UAV anchors detection and anchors-driven UAV pose estimation are employed. To realize robust and accurate anchors detection, we design and implement a Block-CNN architecture to reduce the impact of the outliers. With the basis of the anchors, monocular and stereo vision-based filters are established to update the UAV position and attitude. To expand the training dataset without extra outdoor experiments, we develop a parallel system containing the outdoor and simulated systems with the same configuration. Simulated and outdoor experiments are performed to demonstrate the remarkable pose estimation accuracy improvement compared with the conventional Perspective-N-Points solution. In addition, the experiments also validate the feasibility of the proposed architecture and algorithm in terms of the accuracy and real-time capability requirements for fixed-wing autolanding UAVs.</p>


2019 ◽  
Vol 6 ◽  
pp. 205566831881345 ◽  
Author(s):  
Rezvan Kianifar ◽  
Vladimir Joukov ◽  
Alexander Lee ◽  
Sachin Raina ◽  
Dana Kulić

Introduction Inertial measurement units have been proposed for automated pose estimation and exercise monitoring in clinical settings. However, many existing methods assume an extensive calibration procedure, which may not be realizable in clinical practice. In this study, an inertial measurement unit-based pose estimation method using extended Kalman filter and kinematic chain modeling is adapted for lower body pose estimation during clinical mobility tests such as the single leg squat, and the sensitivity to parameter calibration is investigated. Methods The sensitivity of pose estimation accuracy to each of the kinematic model and sensor placement parameters was analyzed. Sensitivity analysis results suggested that accurate extraction of inertial measurement unit orientation on the body is a key factor in improving the accuracy. Hence, a simple calibration protocol was proposed to reach a better approximation for inertial measurement unit orientation. Results After applying the protocol, the ankle, knee, and hip joint angle errors improved to [Formula: see text], and [Formula: see text], without the need for any other calibration. Conclusions Only a small subset of kinematic and sensor parameters contribute significantly to pose estimation accuracy when using body worn inertial sensors. A simple calibration procedure identifying the inertial measurement unit orientation on the body can provide good pose estimation performance.


Author(s):  
Yapeng Gao

For table tennis robots, it is a significant challenge to understand the opponent's movements and return the ball accordingly with high performance. One has to cope with various ball speeds and spins resulting from different stroke types. In this paper, we propose a real-time 6D racket pose detection method and classify racket movements into five stroke categories with a neural network. By using two monocular cameras, we can extract the racket's contours and choose some special points as feature points in image coordinates. With the 3D geometrical information of a racket, a wide baseline stereo matching method is proposed to find the corresponding feature points and compute the 3D position and orientation of the racket by triangulation and plane fitting. Then, a Kalman filter is adopted to track the racket pose, and a multilayer perceptron (MLP) neural network is used to classify the pose movements. We conduct two experiments to evaluate the accuracy of racket pose detection and classification, in which the average error in position and orientation is around 7.8 mm and 7.2 by comparing with the ground truth from a KUKA robot. The classification accuracy is 98%, the same as the human pose estimation method with Convolutional Pose Machines (CPMs).


Author(s):  
Chunbo Cheng ◽  
Hong Li ◽  
Liming Zhang

Supervised stereo matching costs need to learn model parameters from public datasets with ground truth disparity maps. However, it is not so easy to obtain the ground truth disparity maps, thus making the supervised stereo matching costs difficult to apply in practice. This paper proposes an unsupervised stereo matching cost based on sparse representation (USMCSR). This method does not rely on the ground truth disparity maps, besides, it also can reduce the effects of the illumination and exposure changes, thus making it suitable for measuring similarity between pixels in stereo matching. In order to achieve higher computational efficiency, we further propose an efficient parallel method for solving sparse representation coefficients. The extended experimental results on three commonly used datasets demonstrate the effectiveness of the proposed method. Finally, the verification results on the monocular video clip show the USMCSR can also work well without ground truth disparity maps.


Sign in / Sign up

Export Citation Format

Share Document