Real-time 3D Perception of Scene with Monocular Camera

Depth is a vital prerequisite for the fulfillment of various tasks such as perception, navigation, and planning. Estimating depth using only a single image is a challenging task since the analytic mapping is not available between the intensity image and its depth where the features cue of the context is usually absent in the single image. Furthermore, most current researchers rely on the supervised Learning approach to handle depth estimation. Therefore, the demand for recorded ground truth depth is important at the training time, which is actually tricky and costly. This study presents two approaches (unsupervised learning and semi-supervised learning) to learn the depth information using only a single RGB-image. The main objective of depth estimation is to extract a representation of the spatial structure of the environment and to restore the 3D shape and visual appearance of objects in imagery.

Download Full-text

DEEP LEARNING FOR MONOCULAR DEPTH ESTIMATION FROM UAV IMAGES

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-451-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 451-458

Author(s):

L. Madhuanand ◽

F. Nex ◽

M. Y. Yang

Keyword(s):

Deep Learning ◽

Ground Level ◽

Depth Estimation ◽

Aerial Images ◽

Aerial Image ◽

Depth Information ◽

Single Image ◽

Monocular Depth ◽

Uav Images ◽

Image Depth

Abstract. Depth is an essential component for various scene understanding tasks and for reconstructing the 3D geometry of the scene. Estimating depth from stereo images requires multiple views of the same scene to be captured which is often not possible when exploring new environments with a UAV. To overcome this monocular depth estimation has been a topic of interest with the recent advancements in computer vision and deep learning techniques. This research has been widely focused on indoor scenes or outdoor scenes captured at ground level. Single image depth estimation from aerial images has been limited due to additional complexities arising from increased camera distance, wider area coverage with lots of occlusions. A new aerial image dataset is prepared specifically for this purpose combining Unmanned Aerial Vehicles (UAV) images covering different regions, features and point of views. The single image depth estimation is based on image reconstruction techniques which uses stereo images for learning to estimate depth from single images. Among the various available models for ground-level single image depth estimation, two models, 1) a Convolutional Neural Network (CNN) and 2) a Generative Adversarial model (GAN) are used to learn depth from aerial images from UAVs. These models generate pixel-wise disparity images which could be converted into depth information. The generated disparity maps from these models are evaluated for its internal quality using various error metrics. The results show higher disparity ranges with smoother images generated by CNN model and sharper images with lesser disparity range generated by GAN model. The produced disparity images are converted to depth information and compared with point clouds obtained using Pix4D. It is found that the CNN model performs better than GAN and produces depth similar to that of Pix4D. This comparison helps in streamlining the efforts to produce depth from a single aerial image.

Download Full-text

MaskUKF: An Instance Segmentation Aided Unscented Kalman Filter for 6D Object Pose and Velocity Tracking

Frontiers in Robotics and AI ◽

10.3389/frobt.2021.594583 ◽

2021 ◽

Vol 8 ◽

Author(s):

Nicola A. Piga ◽

Fabrizio Bottarel ◽

Claudio Fantacci ◽

Giulia Vezzani ◽

Ugo Pattacini ◽

...

Keyword(s):

Kalman Filter ◽

Pose Estimation ◽

Unscented Kalman Filter ◽

Ground Truth ◽

Depth Information ◽

Loop Control ◽

Training Time ◽

Pose Tracking ◽

Supplementary Material ◽

Velocity Tracking

Tracking the 6D pose and velocity of objects represents a fundamental requirement for modern robotics manipulation tasks. This paper proposes a 6D object pose tracking algorithm, called MaskUKF, that combines deep object segmentation networks and depth information with a serial Unscented Kalman Filter to track the pose and the velocity of an object in real-time. MaskUKF achieves and in most cases surpasses state-of-the-art performance on the YCB-Video pose estimation benchmark without the need for expensive ground truth pose annotations at training time. Closed loop control experiments on the iCub humanoid platform in simulation show that joint pose and velocity tracking helps achieving higher precision and reliability than with one-shot deep pose estimation networks. A video of the experiments is available as Supplementary Material.

Download Full-text

Fast Depth Estimation in a Single Image Using Lightweight Efficient Neural Network

Sensors ◽

10.3390/s19204434 ◽

2019 ◽

Vol 19 (20) ◽

pp. 4434 ◽

Cited By ~ 1

Author(s):

Sangwon Kim ◽

Jaeyeal Nam ◽

Byoungchul Ko

Keyword(s):

Neural Network ◽

Real Time ◽

Fundamental Problem ◽

Depth Map ◽

Ground Truth ◽

Depth Estimation ◽

Depth Range ◽

Single Image ◽

Special Equipment ◽

Multiple Images

Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional methods re-construct scenes using feature points extracted from multiple images; however, these approaches require multiple images and thus are not easily implemented in various real-time applications. Moreover, the special equipment required by hardware-based approaches using 3D sensors is expensive. Therefore, software-based methods for estimating depth from a single image using machine learning or deep learning are emerging as new alternatives. In this paper, we propose an algorithm that generates a depth map in real time using a single image and an optimized lightweight efficient neural network (L-ENet) algorithm instead of physical equipment, such as an infrared sensor or multi-view camera. Because depth values have a continuous nature and can produce locally ambiguous results, pixel-wise prediction with ordinal depth range classification was applied in this study. In addition, in our method various convolution techniques are applied to extract a dense feature map, and the number of parameters is greatly reduced by reducing the network layer. By using the proposed L-ENet algorithm, an accurate depth map can be generated from a single image quickly and, in a comparison with the ground truth, we can produce depth values closer to those of the ground truth with small errors. Experiments confirmed that the proposed L-ENet can achieve a significantly improved estimation performance over the state-of-the-art algorithms in depth estimation based on a single image.

Download Full-text

Survey on Supervised Learning Based Depth Estimation from a Single Image

Journal of Computer-Aided Design & Computer Graphics ◽

10.3724/sp.j.1089.2018.16882 ◽

2018 ◽

Vol 30 (8) ◽

pp. 1383 ◽

Cited By ~ 1

Author(s):

Tianteng Bi ◽

Yue Liu ◽

Dongdong Weng ◽

Yongtian Wang

Keyword(s):

Supervised Learning ◽

Depth Estimation ◽

Single Image

Download Full-text

Single-Image Depth Inference Using Generative Adversarial Networks

Sensors ◽

10.3390/s19071708 ◽

2019 ◽

Vol 19 (7) ◽

pp. 1708 ◽

Cited By ~ 1

Author(s):

Daniel Stanley Tan ◽

Chih-Yuan Yao ◽

Conrado Ruiz ◽

Kai-Lung Hua

Keyword(s):

Smart Cities ◽

Depth Map ◽

Depth Estimation ◽

Input Image ◽

Generative Adversarial Networks ◽

Depth Information ◽

Single Image ◽

Neural Network Models ◽

Generative Adversarial Network ◽

Depth Sensors

Depth has been a valuable piece of information for perception tasks such as robot grasping, obstacle avoidance, and navigation, which are essential tasks for developing smart homes and smart cities. However, not all applications have the luxury of using depth sensors or multiple cameras to obtain depth information. In this paper, we tackle the problem of estimating the per-pixel depths from a single image. Inspired by the recent works on generative neural network models, we formulate the task of depth estimation as a generative task where we synthesize an image of the depth map from a single Red, Green, and Blue (RGB) input image. We propose a novel generative adversarial network that has an encoder-decoder type generator with residual transposed convolution blocks trained with an adversarial loss. Quantitative and qualitative experimental results demonstrate the effectiveness of our approach over several depth estimation works.

Download Full-text

Domain gap in adapting self-supervised depth estimation methods for stereo-endoscopy

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2020-0004 ◽

2020 ◽

Vol 6 (1) ◽

Author(s):

Lalith Sharan ◽

Lukas Burger ◽

Georgii Kostiuchik ◽

Ivo Wolf ◽

Matthias Karck ◽

...

Keyword(s):

Visual Information ◽

Ground Truth ◽

Depth Estimation ◽

Autonomous Driving ◽

Mitral Valve Surgery ◽

Estimation Methods ◽

Depth Information ◽

Depth Sensor ◽

Detection Range ◽

Depth Sensors

AbstractIn endoscopy, depth estimation is a task that potentially helps in quantifying visual information for better scene understanding. A plethora of depth estimation algorithms have been proposed in the computer vision community. The endoscopic domain however, differs from the typical depth estimation scenario due to differences in the setup and nature of the scene. Furthermore, it is unfeasible to obtain ground truth depth information owing to an unsuitable detection range of off-the-shelf depth sensors and difficulties in setting up a depth-sensor in a surgical environment. In this paper, an existing self-supervised approach, called Monodepth [1], from the field of autonomous driving is applied to a novel dataset of stereo-endoscopic images from reconstructive mitral valve surgery. While it is already known that endoscopic scenes are more challenging than outdoor driving scenes, the paper performs experiments to quantify the comparison, and describe the domain gap and challenges involved in the transfer of these methods.

Download Full-text

Recovering Depth from Still Images for Underwater Dehazing Using Deep Learning

Sensors ◽

10.3390/s20164580 ◽

2020 ◽

Vol 20 (16) ◽

pp. 4580

Author(s):

Javier Pérez ◽

Mitch Bryson ◽

Stefan B. Williams ◽

Pedro J. Sanz

Keyword(s):

Neural Network ◽

Ground Truth ◽

Depth Information ◽

Single Image ◽

Still Images ◽

Underwater Image ◽

The Neural Network ◽

Training Stage ◽

Robot Performance ◽

New Perspective

Estimating depth from a single image is a challenging problem, but it is also interesting due to the large amount of applications, such as underwater image dehazing. In this paper, a new perspective is provided; by taking advantage of the underwater haze that may provide a strong cue to the depth of the scene, a neural network can be used to estimate it. Using this approach the depthmap can be used in a dehazing method to enhance the image and recover original colors, offering a better input to image recognition algorithms and, thus, improving the robot performance during vision-based tasks such as object detection and characterization of the seafloor. Experiments are conducted on different datasets that cover a wide variety of textures and conditions, while using a dense stereo depthmap as ground truth for training, validation and testing. The results show that the neural network outperforms other alternatives, such as the dark channel prior methods and it is able to accurately estimate depth from a single image after a training stage with depth information.

Download Full-text

Unsupervised Monocular Depth Estimation Based on Residual Neural Network of Coarse–Refined Feature Extractions for Drone

Electronics ◽

10.3390/electronics8101179 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1179 ◽

Cited By ~ 1

Author(s):

Tao Huang ◽

Shuanfeng Zhao ◽

Longlong Geng ◽

Qian Xu

Keyword(s):

Neural Network ◽

Image Reconstruction ◽

Depth Map ◽

Ground Truth ◽

Depth Estimation ◽

Input Image ◽

Superior Performance ◽

Estimation Methods ◽

Depth Information ◽

Monocular Depth

To take full advantage of the information of images captured by drones and given that most existing monocular depth estimation methods based on supervised learning require vast quantities of corresponding ground truth depth data for training, the model of unsupervised monocular depth estimation based on residual neural network of coarse–refined feature extractions for drone is therefore proposed. As a virtual camera is introduced through a deep residual convolution neural network based on coarse–refined feature extractions inspired by the principle of binocular depth estimation, the unsupervised monocular depth estimation has become an image reconstruction problem. To improve the performance of our model for monocular depth estimation, the following innovations are proposed. First, the pyramid processing for input image is proposed to build the topological relationship between the resolution of input image and the depth of input image, which can improve the sensitivity of depth information from a single image and reduce the impact of input image resolution on depth estimation. Second, the residual neural network of coarse–refined feature extractions for corresponding image reconstruction is designed to improve the accuracy of feature extraction and solve the contradiction between the calculation time and the numbers of network layers. In addition, to predict high detail output depth maps, the long skip connections between corresponding layers in the neural network of coarse feature extractions and deconvolution neural network of refined feature extractions are designed. Third, the loss of corresponding image reconstruction based on the structural similarity index (SSIM), the loss of approximate disparity smoothness and the loss of depth map are united as a novel training loss to better train our model. The experimental results show that our model has superior performance on the KITTI dataset composed by corresponding left view and right view and Make3D dataset composed by image and corresponding ground truth depth map compared to the state-of-the-art monocular depth estimation methods and basically meet the requirements for depth information of images captured by drones when our model is trained on KITTI.

Download Full-text

A supervised learning approach to far range depth estimation using a consumer-grade RGB-D camera

2013 IEEE International Conference on Electronics, Computing and Communication Technologies ◽

10.1109/conecct.2013.6469299 ◽

2013 ◽

Author(s):

Prabhakar Mishra ◽

Anirudh Viswanathan ◽

Aditi Srinivasan

Keyword(s):

Supervised Learning ◽

Depth Estimation ◽

Learning Approach

Download Full-text

SELF-SUPERVISED LEARNING FOR MONOCULAR DEPTH ESTIMATION FROM AERIAL IMAGERY

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-357-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 357-364

Author(s):

M. Hermann ◽

B. Ruf ◽

M. Weinmann ◽

S. Hinz

Keyword(s):

Supervised Learning ◽

Image Matching ◽

Ground Truth ◽

Depth Estimation ◽

Training Data ◽

Aerial Imagery ◽

Small Model ◽

Conventional Methods ◽

Monocular Depth ◽

Real Time Application

Abstract. Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1:25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.

Download Full-text