Efficient Monocular Depth Estimation with Transfer Feature Enhancement

Estimating the depth of the scene from a monocular image is an essential step for image semantic understanding. Practically, some existing methods for this highly ill-posed issue are still in lack of robustness and efficiency. This paper proposes a novel end-to-end depth esti- mation model with skip connections from a pre- trained Xception model for dense feature extrac- tion, and three new modules are designed to im- prove the upsampling process. In addition, ELU activation and convolutions with smaller kernel size are added to improve the pixel-wise regres- sion process. The experimental results show that our model has fewer network parameters, a lower error rate than the most advanced networks and requires only half the training time. The evalu- ation is based on the NYU v2 dataset, and our proposed model can achieve clearer boundary de- tails with state-of-the-art effects and robustness.

Download Full-text

Deep Learning-Based Monocular Depth Estimation Methods—A State-of-the-Art Review

Sensors ◽

10.3390/s20082272 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2272 ◽

Cited By ~ 5

Author(s):

Faisal Khan ◽

Saqib Salahuddin ◽

Hossein Javidnia

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Research Work ◽

Depth Estimation ◽

Autonomous Driving ◽

Estimation Methods ◽

Future Research ◽

Comprehensive Overview ◽

Ill Posed ◽

Monocular Depth

Monocular depth estimation from Red-Green-Blue (RGB) images is a well-studied ill-posed problem in computer vision which has been investigated intensively over the past decade using Deep Learning (DL) approaches. The recent approaches for monocular depth estimation mostly rely on Convolutional Neural Networks (CNN). Estimating depth from two-dimensional images plays an important role in various applications including scene reconstruction, 3D object-detection, robotics and autonomous driving. This survey provides a comprehensive overview of this research topic including the problem representation and a short description of traditional methods for depth estimation. Relevant datasets and 13 state-of-the-art deep learning-based approaches for monocular depth estimation are reviewed, evaluated and discussed. We conclude this paper with a perspective towards future research work requiring further investigation in monocular depth estimation challenges.

Download Full-text

EfficientNet-B0 Based Monocular Dense-Depth Map Estimation

Traitement du signal ◽

10.18280/ts.380524 ◽

2021 ◽

Vol 38 (5) ◽

pp. 1485-1493

Author(s):

Yasasvy Tadepalli ◽

Meenakshi Kollati ◽

Swaraja Kuraparthi ◽

Padmavathi Kora

Keyword(s):

Depth Map ◽

Depth Estimation ◽

Model Parameters ◽

Map Estimation ◽

Bilinear Method ◽

Regression Problem ◽

Actual Error ◽

Ill Posed ◽

Monocular Image ◽

Monocular Depth

Monocular depth estimation is a hot research topic in autonomous car driving. Deep convolution neural networks (DCNN) comprising encoder and decoder with transfer learning are exploited in the proposed work for monocular depth map estimation of two-dimensional images. Extracted CNN features from initial stages are later upsampled using a sequence of Bilinear UpSampling and convolution layers to reconstruct the depth map. The encoder forms the feature extraction part, and the decoder forms the image reconstruction part. EfficientNetB0, a new architecture is used with pretrained weights as encoder. It is a revolutionary architecture with smaller model parameters yet achieving higher efficiencies than the architectures of state-of-the-art, pretrained networks. EfficientNet-B0 is compared with two other pretrained networks, the DenseNet-121 and ResNet50 models. Each of these three models are used in encoding stage for features extraction followed by bilinear method of UpSampling in the decoder. The Monocular image is an ill-posed problem and is thus considered as a regression problem. So the metrics used in the proposed work are F1-score, Jaccard score and Mean Actual Error (MAE) etc., between the original and the reconstructed image. The results convey that EfficientNet-B0 outperforms in validation loss, F1-score and Jaccard score compared to DenseNet-121 and ResNet-50 models.

Download Full-text

Monocular Depth Estimation with Joint Attention Feature Distillation and Wavelet-Based Loss Function

Sensors ◽

10.3390/s21010054 ◽

2020 ◽

Vol 21 (1) ◽

pp. 54

Author(s):

Peng Liu ◽

Zonghua Zhang ◽

Zhaozong Meng ◽

Nan Gao

Keyword(s):

Joint Attention ◽

Loss Function ◽

Depth Estimation ◽

Depth Information ◽

3D Vision ◽

Network Training ◽

Crucial Component ◽

Benchmark Datasets ◽

Ill Posed ◽

Monocular Depth

Depth estimation is a crucial component in many 3D vision applications. Monocular depth estimation is gaining increasing interest due to flexible use and extremely low system requirements, but inherently ill-posed and ambiguous characteristics still cause unsatisfactory estimation results. This paper proposes a new deep convolutional neural network for monocular depth estimation. The network applies joint attention feature distillation and wavelet-based loss function to recover the depth information of a scene. Two improvements were achieved, compared with previous methods. First, we combined feature distillation and joint attention mechanisms to boost feature modulation discrimination. The network extracts hierarchical features using a progressive feature distillation and refinement strategy and aggregates features using a joint attention operation. Second, we adopted a wavelet-based loss function for network training, which improves loss function effectiveness by obtaining more structural details. The experimental results on challenging indoor and outdoor benchmark datasets verified the proposed method’s superiority compared with current state-of-the-art methods.

Download Full-text

PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency

Applied Sciences ◽

10.3390/app11125383 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5383

Author(s):

Huachen Gao ◽

Xiaoyu Liu ◽

Meixia Qu ◽

Shijie Huang

Keyword(s):

Data Augmentation ◽

State Of The Art ◽

Depth Estimation ◽

Input Image ◽

Depth Information ◽

Disparity Map ◽

Estimation Model ◽

Absolute Relative Error ◽

Texture Region ◽

Monocular Depth

In recent studies, self-supervised learning methods have been explored for monocular depth estimation. They minimize the reconstruction loss of images instead of depth information as a supervised signal. However, existing methods usually assume that the corresponding points in different views should have the same color, which leads to unreliable unsupervised signals and ultimately damages the reconstruction loss during the training. Meanwhile, in the low texture region, it is unable to predict the disparity value of pixels correctly because of the small number of extracted features. To solve the above issues, we propose a network—PDANet—that integrates perceptual consistency and data augmentation consistency, which are more reliable unsupervised signals, into a regular unsupervised depth estimation model. Specifically, we apply a reliable data augmentation mechanism to minimize the loss of the disparity map generated by the original image and the augmented image, respectively, which will enhance the robustness of the image in the prediction of color fluctuation. At the same time, we aggregate the features of different layers extracted by a pre-trained VGG16 network to explore the higher-level perceptual differences between the input image and the generated one. Ablation studies demonstrate the effectiveness of each components, and PDANet shows high-quality depth estimation results on the KITTI benchmark, which optimizes the state-of-the-art method from 0.114 to 0.084, measured by absolute relative error for depth estimation.

Download Full-text

DCPNet: A Densely Connected Pyramid Network for Monocular Depth Estimation

Sensors ◽

10.3390/s21206780 ◽

2021 ◽

Vol 21 (20) ◽

pp. 6780

Author(s):

Zhitong Lai ◽

Rui Tian ◽

Zhiguo Wu ◽

Nannan Ding ◽

Linjian Sun ◽

...

Keyword(s):

Multiple Scales ◽

Feature Fusion ◽

State Of The Art ◽

Depth Estimation ◽

Multi Scale ◽

Pyramid Structure ◽

Benchmark Datasets ◽

The Common ◽

Monocular Depth ◽

Multiple Stages

Pyramid architecture is a useful strategy to fuse multi-scale features in deep monocular depth estimation approaches. However, most pyramid networks fuse features only within the adjacent stages in a pyramid structure. To take full advantage of the pyramid structure, inspired by the success of DenseNet, this paper presents DCPNet, a densely connected pyramid network that fuses multi-scale features from multiple stages of the pyramid structure. DCPNet not only performs feature fusion between the adjacent stages, but also non-adjacent stages. To fuse these features, we design a simple and effective dense connection module (DCM). In addition, we offer a new consideration of the common upscale operation in our approach. We believe DCPNet offers a more efficient way to fuse features from multiple scales in a pyramid-like network. We perform extensive experiments using both outdoor and indoor benchmark datasets (i.e., the KITTI and the NYU Depth V2 datasets) and DCPNet achieves the state-of-the-art results.

Download Full-text

Task-Aware Monocular Depth Estimation for 3D Object Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6908 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12257-12264 ◽

Cited By ~ 1

Author(s):

Xinlong Wang ◽

Wei Yin ◽

Tao Kong ◽

Yuning Jiang ◽

Lei Li ◽

...

Keyword(s):

Object Detection ◽

State Of The Art ◽

Depth Estimation ◽

3D Perception ◽

Research Attention ◽

3D Object ◽

Depth Prediction ◽

Monocular Depth ◽

Almost All ◽

3D Object Detection

Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions (“things and stuff”) in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyze the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground and background depth using separate optimization objectives and decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods. Code will be available at: https://github.com/WXinlong/ForeSeE.

Download Full-text

Knowledge Transfer for Out-of-Knowledge-Base Entities : A Graph Neural Network Approach

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/250 ◽

2017 ◽

Cited By ~ 26

Author(s):

Takuo Hamaguchi ◽

Hidekazu Oiwa ◽

Masashi Shimbo ◽

Yuji Matsumoto

Keyword(s):

Neural Network ◽

Knowledge Base ◽

State Of The Art ◽

Test Time ◽

Network Approach ◽

Missing Information ◽

Neural Network Approach ◽

Training Time ◽

Proposed Model ◽

Graph Neural Networks

Knowledge base completion (KBC) aims to predict missing information in a knowledge base. In this paper, we address the out-of-knowledge-base (OOKB) entity problem in KBC: how to answer queries concerning test entities not observed at training time. Existing embedding-based KBC models assume that all test entities are available at training time, making it unclear how to obtain embeddings for new entities without costly retraining. To solve the OOKB entity problem without retraining, we use graph neural networks (Graph-NNs) to compute the embeddings of OOKB entities, exploiting the limited auxiliary knowledge provided at test time. The experimental results show the effectiveness of our proposed model in the OOKB setting. Additionally, in the standard KBC setting in which OOKB entities are not involved, our model achieves state-of-the-art performance on the WordNet dataset.

Download Full-text

A CNN Model for Human Parsing Based on Capacity Optimization

Applied Sciences ◽

10.3390/app9071330 ◽

2019 ◽

Vol 9 (7) ◽

pp. 1330 ◽

Cited By ~ 1

Author(s):

Yalong Jiang ◽

Zheru Chi

Keyword(s):

Neural Networks ◽

Computational Efficiency ◽

Semantic Information ◽

State Of The Art ◽

Depth Estimation ◽

Baseline Model ◽

Computational Burden ◽

Proposed Model ◽

Saliency Prediction ◽

Benchmark Solutions

Although a state-of-the-art performance has been achieved in pixel-specific tasks, such as saliency prediction and depth estimation, convolutional neural networks (CNNs) still perform unsatisfactorily in human parsing where semantic information of detailed regions needs to be perceived under the influences of variations in viewpoints, poses, and occlusions. In this paper, we propose to improve the robustness of human parsing modules by introducing a depth-estimation module. A novel scheme is proposed for the integration of a depth-estimation module and a human-parsing module. The robustness of the overall model is improved with the automatically obtained depth labels. As another major concern, the computational efficiency is also discussed. Our proposed human parsing module with 24 layers can achieve a similar performance as the baseline CNN model with over 100 layers. The number of parameters in the overall model is less than that in the baseline model. Furthermore, we propose to reduce the computational burden by replacing a conventional CNN layer with a stack of simplified sub-layers to further reduce the overall number of trainable parameters. Experimental results show that the integration of two modules contributes to the improvement of human parsing without additional human labeling. The proposed model outperforms the benchmark solutions and the capacity of our model is better matched to the complexity of the task.

Download Full-text

Unsupervised Monocular Depth Estimation for Colonoscope System Using Feedback Network

Sensors ◽

10.3390/s21082691 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2691

Author(s):

Seung-Jun Hwang ◽

Sung-Jun Park ◽

Gyu-Min Kim ◽

Joong-Hwan Baek

Keyword(s):

State Of The Art ◽

Qualitative Evaluation ◽

Depth Estimation ◽

Depth Information ◽

Polyp Detection ◽

Feedback Network ◽

Polyp Detection Rate ◽

Previous Frame ◽

Monocular Depth ◽

Spatiotemporal Consistency

A colonoscopy is a medical examination used to check disease or abnormalities in the large intestine. If necessary, polyps or adenomas would be removed through the scope during a colonoscopy. Colorectal cancer can be prevented through this. However, the polyp detection rate differs depending on the condition and skill level of the endoscopist. Even some endoscopists have a 90% chance of missing an adenoma. Artificial intelligence and robot technologies for colonoscopy are being studied to compensate for these problems. In this study, we propose a self-supervised monocular depth estimation using spatiotemporal consistency in the colon environment. It is our contribution to propose a loss function for reconstruction errors between adjacent predicted depths and a depth feedback network that uses predicted depth information of the previous frame to predict the depth of the next frame. We performed quantitative and qualitative evaluation of our approach, and the proposed FBNet (depth FeedBack Network) outperformed state-of-the-art results for unsupervised depth estimation on the UCL datasets.

Download Full-text

CNN Based Monocular Depth Estimation

E3S Web of Conferences ◽

10.1051/e3sconf/202130901070 ◽

2021 ◽

Vol 309 ◽

pp. 01070

Author(s):

K. Swaraja ◽

K. Naga Siva Pavan ◽

S. Suryakanth Reddy ◽

K. Ajay ◽

P. Uday Kiran Reddy ◽

...

Keyword(s):

High Resolution ◽

Depth Map ◽

Depth Estimation ◽

Learning Approaches ◽

Depth Estimate ◽

High Performing ◽

Depth Measurement ◽

Ill Posed ◽

Monocular Depth ◽

Public Datasets

In several applications, such as scene interpretation and reconstruction, precise depth measurement from images is a significant challenge. Current depth estimate techniques frequently provide fuzzy, low-resolution estimates. With the use of transfer learning, this research executes a convolutional neural network for generating a high-resolution depth map from a single RGB image. With a typical encoder-decoder architecture, when initializing the encoder, we use features extracted from high-performing pre-trained networks, as well as augmentation and training procedures that lead to more accurate outcomes. We demonstrate how, even with a very basic decoder, our approach can provide complete high-resolution depth maps. A wide number of deep learning approaches have recently been presented, and they have showed significant promise in dealing with the classical ill-posed issue. The studies are carried out using KITTI and NYU Depth v2, two widely utilized public datasets. We also examine the errors created by various models in order to expose the shortcomings of present approaches which accomplishes viable performance on KITTI besides NYU Depth v2.

Download Full-text