scholarly journals Real-time 3D Perception of Scene with Monocular Camera

2020 ◽  
Vol 7 (2) ◽  
pp. 4-7
Author(s):  
Shadi Saleh ◽  
Shanmugapriyan Manoharan ◽  
Wolfram Hardt

Depth is a vital prerequisite for the fulfillment of various tasks such as perception, navigation, and planning. Estimating depth using only a single image is a challenging task since the analytic mapping is not available between the intensity image and its depth where the features cue of the context is usually absent in the single image. Furthermore, most current researchers rely on the supervised Learning approach to handle depth estimation. Therefore, the demand for recorded ground truth depth is important at the training time, which is actually tricky and costly. This study presents two approaches (unsupervised learning and semi-supervised learning) to learn the depth information using only a single RGB-image. The main objective of depth estimation is to extract a representation of the spatial structure of the environment and to restore the 3D shape and visual appearance of objects in imagery.

Author(s):  
L. Madhuanand ◽  
F. Nex ◽  
M. Y. Yang

Abstract. Depth is an essential component for various scene understanding tasks and for reconstructing the 3D geometry of the scene. Estimating depth from stereo images requires multiple views of the same scene to be captured which is often not possible when exploring new environments with a UAV. To overcome this monocular depth estimation has been a topic of interest with the recent advancements in computer vision and deep learning techniques. This research has been widely focused on indoor scenes or outdoor scenes captured at ground level. Single image depth estimation from aerial images has been limited due to additional complexities arising from increased camera distance, wider area coverage with lots of occlusions. A new aerial image dataset is prepared specifically for this purpose combining Unmanned Aerial Vehicles (UAV) images covering different regions, features and point of views. The single image depth estimation is based on image reconstruction techniques which uses stereo images for learning to estimate depth from single images. Among the various available models for ground-level single image depth estimation, two models, 1) a Convolutional Neural Network (CNN) and 2) a Generative Adversarial model (GAN) are used to learn depth from aerial images from UAVs. These models generate pixel-wise disparity images which could be converted into depth information. The generated disparity maps from these models are evaluated for its internal quality using various error metrics. The results show higher disparity ranges with smoother images generated by CNN model and sharper images with lesser disparity range generated by GAN model. The produced disparity images are converted to depth information and compared with point clouds obtained using Pix4D. It is found that the CNN model performs better than GAN and produces depth similar to that of Pix4D. This comparison helps in streamlining the efforts to produce depth from a single aerial image.


2021 ◽  
Vol 8 ◽  
Author(s):  
Nicola A. Piga ◽  
Fabrizio Bottarel ◽  
Claudio Fantacci ◽  
Giulia Vezzani ◽  
Ugo Pattacini ◽  
...  

Tracking the 6D pose and velocity of objects represents a fundamental requirement for modern robotics manipulation tasks. This paper proposes a 6D object pose tracking algorithm, called MaskUKF, that combines deep object segmentation networks and depth information with a serial Unscented Kalman Filter to track the pose and the velocity of an object in real-time. MaskUKF achieves and in most cases surpasses state-of-the-art performance on the YCB-Video pose estimation benchmark without the need for expensive ground truth pose annotations at training time. Closed loop control experiments on the iCub humanoid platform in simulation show that joint pose and velocity tracking helps achieving higher precision and reliability than with one-shot deep pose estimation networks. A video of the experiments is available as Supplementary Material.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4434 ◽  
Author(s):  
Sangwon Kim ◽  
Jaeyeal Nam ◽  
Byoungchul Ko

Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional methods re-construct scenes using feature points extracted from multiple images; however, these approaches require multiple images and thus are not easily implemented in various real-time applications. Moreover, the special equipment required by hardware-based approaches using 3D sensors is expensive. Therefore, software-based methods for estimating depth from a single image using machine learning or deep learning are emerging as new alternatives. In this paper, we propose an algorithm that generates a depth map in real time using a single image and an optimized lightweight efficient neural network (L-ENet) algorithm instead of physical equipment, such as an infrared sensor or multi-view camera. Because depth values have a continuous nature and can produce locally ambiguous results, pixel-wise prediction with ordinal depth range classification was applied in this study. In addition, in our method various convolution techniques are applied to extract a dense feature map, and the number of parameters is greatly reduced by reducing the network layer. By using the proposed L-ENet algorithm, an accurate depth map can be generated from a single image quickly and, in a comparison with the ground truth, we can produce depth values closer to those of the ground truth with small errors. Experiments confirmed that the proposed L-ENet can achieve a significantly improved estimation performance over the state-of-the-art algorithms in depth estimation based on a single image.


2018 ◽  
Vol 30 (8) ◽  
pp. 1383 ◽  
Author(s):  
Tianteng Bi ◽  
Yue Liu ◽  
Dongdong Weng ◽  
Yongtian Wang

Sensors ◽  
2019 ◽  
Vol 19 (7) ◽  
pp. 1708 ◽  
Author(s):  
Daniel Stanley Tan ◽  
Chih-Yuan Yao ◽  
Conrado Ruiz ◽  
Kai-Lung Hua

Depth has been a valuable piece of information for perception tasks such as robot grasping, obstacle avoidance, and navigation, which are essential tasks for developing smart homes and smart cities. However, not all applications have the luxury of using depth sensors or multiple cameras to obtain depth information. In this paper, we tackle the problem of estimating the per-pixel depths from a single image. Inspired by the recent works on generative neural network models, we formulate the task of depth estimation as a generative task where we synthesize an image of the depth map from a single Red, Green, and Blue (RGB) input image. We propose a novel generative adversarial network that has an encoder-decoder type generator with residual transposed convolution blocks trained with an adversarial loss. Quantitative and qualitative experimental results demonstrate the effectiveness of our approach over several depth estimation works.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Lalith Sharan ◽  
Lukas Burger ◽  
Georgii Kostiuchik ◽  
Ivo Wolf ◽  
Matthias Karck ◽  
...  

AbstractIn endoscopy, depth estimation is a task that potentially helps in quantifying visual information for better scene understanding. A plethora of depth estimation algorithms have been proposed in the computer vision community. The endoscopic domain however, differs from the typical depth estimation scenario due to differences in the setup and nature of the scene. Furthermore, it is unfeasible to obtain ground truth depth information owing to an unsuitable detection range of off-the-shelf depth sensors and difficulties in setting up a depth-sensor in a surgical environment. In this paper, an existing self-supervised approach, called Monodepth [1], from the field of autonomous driving is applied to a novel dataset of stereo-endoscopic images from reconstructive mitral valve surgery. While it is already known that endoscopic scenes are more challenging than outdoor driving scenes, the paper performs experiments to quantify the comparison, and describe the domain gap and challenges involved in the transfer of these methods.


Sensors ◽  
2020 ◽  
Vol 20 (16) ◽  
pp. 4580
Author(s):  
Javier Pérez ◽  
Mitch Bryson ◽  
Stefan B. Williams ◽  
Pedro J. Sanz

Estimating depth from a single image is a challenging problem, but it is also interesting due to the large amount of applications, such as underwater image dehazing. In this paper, a new perspective is provided; by taking advantage of the underwater haze that may provide a strong cue to the depth of the scene, a neural network can be used to estimate it. Using this approach the depthmap can be used in a dehazing method to enhance the image and recover original colors, offering a better input to image recognition algorithms and, thus, improving the robot performance during vision-based tasks such as object detection and characterization of the seafloor. Experiments are conducted on different datasets that cover a wide variety of textures and conditions, while using a dense stereo depthmap as ground truth for training, validation and testing. The results show that the neural network outperforms other alternatives, such as the dark channel prior methods and it is able to accurately estimate depth from a single image after a training stage with depth information.


Electronics ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1179 ◽  
Author(s):  
Tao Huang ◽  
Shuanfeng Zhao ◽  
Longlong Geng ◽  
Qian Xu

To take full advantage of the information of images captured by drones and given that most existing monocular depth estimation methods based on supervised learning require vast quantities of corresponding ground truth depth data for training, the model of unsupervised monocular depth estimation based on residual neural network of coarse–refined feature extractions for drone is therefore proposed. As a virtual camera is introduced through a deep residual convolution neural network based on coarse–refined feature extractions inspired by the principle of binocular depth estimation, the unsupervised monocular depth estimation has become an image reconstruction problem. To improve the performance of our model for monocular depth estimation, the following innovations are proposed. First, the pyramid processing for input image is proposed to build the topological relationship between the resolution of input image and the depth of input image, which can improve the sensitivity of depth information from a single image and reduce the impact of input image resolution on depth estimation. Second, the residual neural network of coarse–refined feature extractions for corresponding image reconstruction is designed to improve the accuracy of feature extraction and solve the contradiction between the calculation time and the numbers of network layers. In addition, to predict high detail output depth maps, the long skip connections between corresponding layers in the neural network of coarse feature extractions and deconvolution neural network of refined feature extractions are designed. Third, the loss of corresponding image reconstruction based on the structural similarity index (SSIM), the loss of approximate disparity smoothness and the loss of depth map are united as a novel training loss to better train our model. The experimental results show that our model has superior performance on the KITTI dataset composed by corresponding left view and right view and Make3D dataset composed by image and corresponding ground truth depth map compared to the state-of-the-art monocular depth estimation methods and basically meet the requirements for depth information of images captured by drones when our model is trained on KITTI.


Author(s):  
M. Hermann ◽  
B. Ruf ◽  
M. Weinmann ◽  
S. Hinz

Abstract. Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1:25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.


Sign in / Sign up

Export Citation Format

Share Document