scholarly journals Efficient Monocular Depth Estimation with Transfer Feature Enhancement

Author(s):  
Ming Yin

Estimating the depth of the scene from a monocular image is an essential step for image semantic understanding. Practically, some existing methods for this highly ill-posed issue are still in lack of robustness and efficiency. This paper proposes a novel end-to-end depth esti- mation model with skip connections from a pre- trained Xception model for dense feature extrac- tion, and three new modules are designed to im- prove the upsampling process. In addition, ELU activation and convolutions with smaller kernel size are added to improve the pixel-wise regres- sion process. The experimental results show that our model has fewer network parameters, a lower error rate than the most advanced networks and requires only half the training time. The evalu- ation is based on the NYU v2 dataset, and our proposed model can achieve clearer boundary de- tails with state-of-the-art effects and robustness.

Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2272 ◽  
Author(s):  
Faisal Khan ◽  
Saqib Salahuddin ◽  
Hossein Javidnia

Monocular depth estimation from Red-Green-Blue (RGB) images is a well-studied ill-posed problem in computer vision which has been investigated intensively over the past decade using Deep Learning (DL) approaches. The recent approaches for monocular depth estimation mostly rely on Convolutional Neural Networks (CNN). Estimating depth from two-dimensional images plays an important role in various applications including scene reconstruction, 3D object-detection, robotics and autonomous driving. This survey provides a comprehensive overview of this research topic including the problem representation and a short description of traditional methods for depth estimation. Relevant datasets and 13 state-of-the-art deep learning-based approaches for monocular depth estimation are reviewed, evaluated and discussed. We conclude this paper with a perspective towards future research work requiring further investigation in monocular depth estimation challenges.


2021 ◽  
Vol 38 (5) ◽  
pp. 1485-1493
Author(s):  
Yasasvy Tadepalli ◽  
Meenakshi Kollati ◽  
Swaraja Kuraparthi ◽  
Padmavathi Kora

Monocular depth estimation is a hot research topic in autonomous car driving. Deep convolution neural networks (DCNN) comprising encoder and decoder with transfer learning are exploited in the proposed work for monocular depth map estimation of two-dimensional images. Extracted CNN features from initial stages are later upsampled using a sequence of Bilinear UpSampling and convolution layers to reconstruct the depth map. The encoder forms the feature extraction part, and the decoder forms the image reconstruction part. EfficientNetB0, a new architecture is used with pretrained weights as encoder. It is a revolutionary architecture with smaller model parameters yet achieving higher efficiencies than the architectures of state-of-the-art, pretrained networks. EfficientNet-B0 is compared with two other pretrained networks, the DenseNet-121 and ResNet50 models. Each of these three models are used in encoding stage for features extraction followed by bilinear method of UpSampling in the decoder. The Monocular image is an ill-posed problem and is thus considered as a regression problem. So the metrics used in the proposed work are F1-score, Jaccard score and Mean Actual Error (MAE) etc., between the original and the reconstructed image. The results convey that EfficientNet-B0 outperforms in validation loss, F1-score and Jaccard score compared to DenseNet-121 and ResNet-50 models.


Sensors ◽  
2020 ◽  
Vol 21 (1) ◽  
pp. 54
Author(s):  
Peng Liu ◽  
Zonghua Zhang ◽  
Zhaozong Meng ◽  
Nan Gao

Depth estimation is a crucial component in many 3D vision applications. Monocular depth estimation is gaining increasing interest due to flexible use and extremely low system requirements, but inherently ill-posed and ambiguous characteristics still cause unsatisfactory estimation results. This paper proposes a new deep convolutional neural network for monocular depth estimation. The network applies joint attention feature distillation and wavelet-based loss function to recover the depth information of a scene. Two improvements were achieved, compared with previous methods. First, we combined feature distillation and joint attention mechanisms to boost feature modulation discrimination. The network extracts hierarchical features using a progressive feature distillation and refinement strategy and aggregates features using a joint attention operation. Second, we adopted a wavelet-based loss function for network training, which improves loss function effectiveness by obtaining more structural details. The experimental results on challenging indoor and outdoor benchmark datasets verified the proposed method’s superiority compared with current state-of-the-art methods.


2021 ◽  
Vol 11 (12) ◽  
pp. 5383
Author(s):  
Huachen Gao ◽  
Xiaoyu Liu ◽  
Meixia Qu ◽  
Shijie Huang

In recent studies, self-supervised learning methods have been explored for monocular depth estimation. They minimize the reconstruction loss of images instead of depth information as a supervised signal. However, existing methods usually assume that the corresponding points in different views should have the same color, which leads to unreliable unsupervised signals and ultimately damages the reconstruction loss during the training. Meanwhile, in the low texture region, it is unable to predict the disparity value of pixels correctly because of the small number of extracted features. To solve the above issues, we propose a network—PDANet—that integrates perceptual consistency and data augmentation consistency, which are more reliable unsupervised signals, into a regular unsupervised depth estimation model. Specifically, we apply a reliable data augmentation mechanism to minimize the loss of the disparity map generated by the original image and the augmented image, respectively, which will enhance the robustness of the image in the prediction of color fluctuation. At the same time, we aggregate the features of different layers extracted by a pre-trained VGG16 network to explore the higher-level perceptual differences between the input image and the generated one. Ablation studies demonstrate the effectiveness of each components, and PDANet shows high-quality depth estimation results on the KITTI benchmark, which optimizes the state-of-the-art method from 0.114 to 0.084, measured by absolute relative error for depth estimation.


Sensors ◽  
2021 ◽  
Vol 21 (20) ◽  
pp. 6780
Author(s):  
Zhitong Lai ◽  
Rui Tian ◽  
Zhiguo Wu ◽  
Nannan Ding ◽  
Linjian Sun ◽  
...  

Pyramid architecture is a useful strategy to fuse multi-scale features in deep monocular depth estimation approaches. However, most pyramid networks fuse features only within the adjacent stages in a pyramid structure. To take full advantage of the pyramid structure, inspired by the success of DenseNet, this paper presents DCPNet, a densely connected pyramid network that fuses multi-scale features from multiple stages of the pyramid structure. DCPNet not only performs feature fusion between the adjacent stages, but also non-adjacent stages. To fuse these features, we design a simple and effective dense connection module (DCM). In addition, we offer a new consideration of the common upscale operation in our approach. We believe DCPNet offers a more efficient way to fuse features from multiple scales in a pyramid-like network. We perform extensive experiments using both outdoor and indoor benchmark datasets (i.e., the KITTI and the NYU Depth V2 datasets) and DCPNet achieves the state-of-the-art results.


2020 ◽  
Vol 34 (07) ◽  
pp. 12257-12264 ◽  
Author(s):  
Xinlong Wang ◽  
Wei Yin ◽  
Tao Kong ◽  
Yuning Jiang ◽  
Lei Li ◽  
...  

Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions (“things and stuff”) in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyze the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground and background depth using separate optimization objectives and decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods. Code will be available at: https://github.com/WXinlong/ForeSeE.


Author(s):  
Takuo Hamaguchi ◽  
Hidekazu Oiwa ◽  
Masashi Shimbo ◽  
Yuji Matsumoto

Knowledge base completion (KBC) aims to predict missing information in a knowledge base. In this paper, we address the out-of-knowledge-base (OOKB) entity problem in KBC: how to answer queries concerning test entities not observed at training time. Existing embedding-based KBC models assume that all test entities are available at training time, making it unclear how to obtain embeddings for new entities without costly retraining. To solve the OOKB entity problem without retraining, we use graph neural networks (Graph-NNs) to compute the embeddings of OOKB entities, exploiting the limited auxiliary knowledge provided at test time. The experimental results show the effectiveness of our proposed model in the OOKB setting. Additionally, in the standard KBC setting in which OOKB entities are not involved, our model achieves state-of-the-art performance on the WordNet dataset.


2019 ◽  
Vol 9 (7) ◽  
pp. 1330 ◽  
Author(s):  
Yalong Jiang ◽  
Zheru Chi

Although a state-of-the-art performance has been achieved in pixel-specific tasks, such as saliency prediction and depth estimation, convolutional neural networks (CNNs) still perform unsatisfactorily in human parsing where semantic information of detailed regions needs to be perceived under the influences of variations in viewpoints, poses, and occlusions. In this paper, we propose to improve the robustness of human parsing modules by introducing a depth-estimation module. A novel scheme is proposed for the integration of a depth-estimation module and a human-parsing module. The robustness of the overall model is improved with the automatically obtained depth labels. As another major concern, the computational efficiency is also discussed. Our proposed human parsing module with 24 layers can achieve a similar performance as the baseline CNN model with over 100 layers. The number of parameters in the overall model is less than that in the baseline model. Furthermore, we propose to reduce the computational burden by replacing a conventional CNN layer with a stack of simplified sub-layers to further reduce the overall number of trainable parameters. Experimental results show that the integration of two modules contributes to the improvement of human parsing without additional human labeling. The proposed model outperforms the benchmark solutions and the capacity of our model is better matched to the complexity of the task.


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2691
Author(s):  
Seung-Jun Hwang ◽  
Sung-Jun Park ◽  
Gyu-Min Kim ◽  
Joong-Hwan Baek

A colonoscopy is a medical examination used to check disease or abnormalities in the large intestine. If necessary, polyps or adenomas would be removed through the scope during a colonoscopy. Colorectal cancer can be prevented through this. However, the polyp detection rate differs depending on the condition and skill level of the endoscopist. Even some endoscopists have a 90% chance of missing an adenoma. Artificial intelligence and robot technologies for colonoscopy are being studied to compensate for these problems. In this study, we propose a self-supervised monocular depth estimation using spatiotemporal consistency in the colon environment. It is our contribution to propose a loss function for reconstruction errors between adjacent predicted depths and a depth feedback network that uses predicted depth information of the previous frame to predict the depth of the next frame. We performed quantitative and qualitative evaluation of our approach, and the proposed FBNet (depth FeedBack Network) outperformed state-of-the-art results for unsupervised depth estimation on the UCL datasets.


2021 ◽  
Vol 309 ◽  
pp. 01070
Author(s):  
K. Swaraja ◽  
K. Naga Siva Pavan ◽  
S. Suryakanth Reddy ◽  
K. Ajay ◽  
P. Uday Kiran Reddy ◽  
...  

In several applications, such as scene interpretation and reconstruction, precise depth measurement from images is a significant challenge. Current depth estimate techniques frequently provide fuzzy, low-resolution estimates. With the use of transfer learning, this research executes a convolutional neural network for generating a high-resolution depth map from a single RGB image. With a typical encoder-decoder architecture, when initializing the encoder, we use features extracted from high-performing pre-trained networks, as well as augmentation and training procedures that lead to more accurate outcomes. We demonstrate how, even with a very basic decoder, our approach can provide complete high-resolution depth maps. A wide number of deep learning approaches have recently been presented, and they have showed significant promise in dealing with the classical ill-posed issue. The studies are carried out using KITTI and NYU Depth v2, two widely utilized public datasets. We also examine the errors created by various models in order to expose the shortcomings of present approaches which accomplishes viable performance on KITTI besides NYU Depth v2.


Sign in / Sign up

Export Citation Format

Share Document