scholarly journals SFA-MDEN: Semantic-Feature-Aided Monocular Depth Estimation Network Using Dual Branches

Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5476
Author(s):  
Rui Wang ◽  
Jialing Zou ◽  
James Zhiqing Wen

Monocular depth estimation based on unsupervised learning has attracted great attention due to the rising demand for lightweight monocular vision sensors. Inspired by multi-task learning, semantic information has been used to improve the monocular depth estimation models. However, multi-task learning is still limited by multi-type annotations. As far as we know, there are scarcely any large public datasets that provide all the necessary information. Therefore, we propose a novel network architecture Semantic-Feature-Aided Monocular Depth Estimation Network (SFA-MDEN) to extract multi-resolution depth features and semantic features, which are merged and fed into the decoder, with the goal of predicting depth with the support of semantics. Instead of using loss functions to relate the semantics and depth, the fusion of feature maps for semantics and depth is employed to predict the monocular depth. Therefore, two accessible datasets with similar topics for depth estimation and semantic segmentation can meet the requirements of SFA-MDEN for training sets. We explored the performance of the proposed SFA-MDEN with experiments on different datasets, including KITTI, Make3D, and our own dataset BHDE-v1. The experimental results demonstrate that SFA-MDEN achieves competitive accuracy and generalization capacity compared to state-of-the-art methods.

2021 ◽  
Vol 13 (9) ◽  
pp. 1673
Author(s):  
Wanpeng Xu ◽  
Ling Zou ◽  
Lingda Wu ◽  
Zhipeng Fu

For the task of monocular depth estimation, self-supervised learning supervises training by calculating the pixel difference between the target image and the warped reference image, obtaining results comparable to those with full supervision. However, the problematic pixels in low-texture regions are ignored, since most researchers think that no pixels violate the assumption of camera motion, taking stereo pairs as the input in self-supervised learning, which leads to the optimization problem in these regions. To tackle this problem, we perform photometric loss using the lowest-level feature maps instead and implement first- and second-order smoothing to the depth, ensuring consistent gradients ring optimization. Given the shortcomings of ResNet as the backbone, we propose a new depth estimation network architecture to improve edge location accuracy and obtain clear outline information even in smoothed low-texture boundaries. To acquire more stable and reliable quantitative evaluation results, we introce a virtual data set in the self-supervised task because these have dense depth maps corresponding to pixel by pixel. We achieve performance that exceeds that of the prior methods on both the Eigen Splits of the KITTI and VKITTI2 data sets taking stereo pairs as the input.


Author(s):  
P. Bodani ◽  
K. Shreshtha ◽  
S. Sharma

<p><strong>Abstract.</strong> This paper addresses the task of semantic segmentation of orthoimagery using multimodal data e.g. optical RGB, infrared and digital surface model. We propose a deep convolutional neural network architecture termed OrthoSeg for semantic segmentation using multimodal, orthorectified and coregistered data. We also propose a training procedure for supervised training of OrthoSeg. The training procedure complements the inherent architectural characteristics of OrthoSeg for preventing complex co-adaptations of learned features, which may arise due to probable high dimensionality and spatial correlation in multimodal and/or multispectral coregistered data. OrthoSeg consists of parallel encoding networks for independent encoding of multimodal feature maps and a decoder designed for efficiently fusing independently encoded multimodal feature maps. A softmax layer at the end of the network uses the features generated by the decoder for pixel-wise classification. The decoder fuses feature maps from the parallel encoders locally as well as contextually at multiple scales to generate per-pixel feature maps for final pixel-wise classification resulting in segmented output. We experimentally show the merits of OrthoSeg by demonstrating state-of-the-art accuracy on the ISPRS Potsdam 2D Semantic Segmentation dataset. Adaptability is one of the key motivations behind OrthoSeg so that it serves as a useful architectural option for a wide range of problems involving the task of semantic segmentation of coregistered multimodal and/or multispectral imagery. Hence, OrthoSeg is designed to enable independent scaling of parallel encoder networks and decoder network to better match application requirements, such as the number of input channels, the effective field-of-view, and model capacity.</p>


Author(s):  
M. R. Bayanlou ◽  
M. Khoshboresh-Masouleh

Abstract. Single-task learning in artificial neural networks will be able to learn the model very well, and the benefits brought by transferring knowledge thus become limited. In this regard, when the number of tasks increases (e.g., semantic segmentation, panoptic segmentation, monocular depth estimation, and 3D point cloud), duplicate information may exist across tasks, and the improvement becomes less significant. Multi-task learning has emerged as a solution to knowledge-transfer issues and is an approach to scene understanding which involves multiple related tasks each with potentially limited training data. Multi-task learning improves generalization by leveraging the domain-specific information contained in the training data of related tasks. In urban management applications such as infrastructure development, traffic monitoring, smart 3D cities, and change detection, automated multi-task data analysis for scene understanding based on the semantic, instance, and panoptic annotation, as well as monocular depth estimation, is required to generate precise urban models. In this study, a common framework for the performance assessment of multi-task learning methods from fixed-wing UAV images for 2D/3D city modelling is presented.


2020 ◽  
Vol 13 (1) ◽  
pp. 56
Author(s):  
Wei Liu ◽  
Xingyu Chen ◽  
Jiangjun Ran ◽  
Lin Liu ◽  
Qiang Wang ◽  
...  

Variations of lake area and shoreline can indicate hydrological and climatic changes effectively. Accordingly, how to automatically and simultaneously extract lake area and shoreline from remote sensing images attracts our attention. In this paper, we formulate lake area and shoreline extraction as a multitask learning problem. Different from existing models that take the deep and complex network architecture as the backbone to extract feature maps, we present LaeNet—a novel end-to-end lightweight multitask fully CNN with no-downsampling to automatically extract lake area and shoreline from remote sensing images. Landsat-8 images over Selenco and the vicinity in the Tibetan Plateau are utilized to train and evaluate our model. Experimental results over the testing image patches achieve an Accuracy of 0.9962, Precision of 0.9912, Recall of 0.9982, F1-score of 0.9941, and mIoU of 0.9879, which align with the mainstream semantic segmentation models (UNet, DeepLabV3+, etc.) or even better. Especially, the running time of each epoch and the size of our model are only 6 s and 0.047 megabytes, which achieve a significant reduction compared to the other models. Finally, we conducted fieldwork to collect the in-situ shoreline position for one typical part of lake Selenco, in order to further evaluate the performance of our model. The validation indicates high accuracy in our results (DRMSE: 30.84 m, DMAE: 22.49 m, DSTD: 21.11 m), only about one pixel deviation for Landsat-8 images. LaeNet can be expanded potentially to the tasks of area segmentation and edge extraction in other application fields.


Author(s):  
Ge Su ◽  
Bo Lin ◽  
Wei Luo ◽  
Jianwei Yin ◽  
Shuiguang Deng ◽  
...  

Parkinson’s disease is the second most common neurodegenerative disorder, commonly affecting elderly people over the age of 65. As the cardinal manifestation, hypomimia, referred to as impairments in normal facial expressions, stays covert. Even some experienced doctors may miss these subtle changes, especially in a mild stage of this disease. The existing methods for hypomimia recognition are mainly dominated by statistical variable-based methods with the help of traditional machine learning algorithms. Despite the success of recognizing hypomimia, they show a limited accuracy and lack the capability of performing semantic analysis. Therefore, developing a computer-aided diagnostic method for semantically recognizing hypomimia is appealing. In this article, we propose a Semantic Feature based Hypomimia Recognition network , named SFHR-NET , to recognize hypomimia based on facial videos. First, a Semantic Feature Classifier (SF-C) is proposed to adaptively adjust feature maps salient to hypomimia, which leads the encoder and classifier to focus more on areas of hypomimia-interest. In SF-C, the progressive confidence strategy (PCS) ensures more reliable semantic features. Then, a two-stream framework is introduced to fuse the spatial data stream and temporal optical stream, which allows the encoder to semantically and progressively characterize the rigid process of hypomimia. Finally, to improve the interpretability of the model, Gradient-weighted Class Activation Mapping (Grad-CAM) is integrated to generate attention maps that cast our engineered features into hypomimia-interest regions. These highlighted regions provide visual explanations for decisions of our network. Experimental results based on real-world data demonstrate the effectiveness of our method in detecting hypomimia.


Sensors ◽  
2019 ◽  
Vol 19 (14) ◽  
pp. 3224 ◽  
Author(s):  
Pablo R. Palafox ◽  
Johannes Betz ◽  
Felix Nobis ◽  
Konstantin Riedl ◽  
Markus Lienkamp

Typically, lane departure warning systems rely on lane lines being present on the road.However, in many scenarios, e.g., secondary roads or some streets in cities, lane lines are eithernot present or not sufficiently well signaled. In this work, we present a vision-based method tolocate a vehicle within the road when no lane lines are present using only RGB images as input.To this end, we propose to fuse together the outputs of a semantic segmentation and a monoculardepth estimation architecture to reconstruct locally a semantic 3D point cloud of the viewed scene.We only retain points belonging to the road and, additionally, to any kind of fences or walls thatmight be present right at the sides of the road. We then compute the width of the road at a certainpoint on the planned trajectory and, additionally, what we denote as the fence-to-fence distance.Our system is suited to any kind of motoring scenario and is especially useful when lane lines arenot present on the road or do not signal the path correctly. The additional fence-to-fence distancecomputation is complementary to the road’s width estimation. We quantitatively test our methodon a set of images featuring streets of the city of Munich that contain a road-fence structure, so asto compare our two proposed variants, namely the road’s width and the fence-to-fence distancecomputation. In addition, we also validate our system qualitatively on the Stuttgart sequence of thepublicly available Cityscapes dataset, where no fences or walls are present at the sides of the road,thus demonstrating that our system can be deployed in a standard city-like environment. For thebenefit of the community, we make our software open source.


2021 ◽  
Author(s):  
Feng Wei ◽  
XingHui Yin ◽  
Jie Shen ◽  
HuiBin Wang

Abstract With the development of depth learning, the accuracy and effect of the algorithm applied to monocular depth estimation have been greatly improved, but the existing algorithms need a lot of computing resources. At present, how to apply the existing algorithms to UAV and its small robot is an urgent need.Based on full convolution neural network and Kitti dataset, this paper uses deep separable convolution to optimize the network architecture, reduce training parameters and improve computing speed. Experimental results show that our method is very effective and has a certain reference value in the development direction of monocular depth estimation algorithm.


Author(s):  
Bin Wang ◽  
Guojun Qi ◽  
Sheng Tang ◽  
Tianzhu Zhang ◽  
Yunchao Wei ◽  
...  

Semantic segmentation suffers from the fact that densely annotated masks are expensive to obtain. To tackle this problem, we aim at learning to segment by only leveraging scribbles that are much easier to collect for supervision. To fully explore the limited pixel-level annotations from scribbles, we present a novel Boundary Perception Guidance (BPG) approach, which consists of two basic components, i.e., prediction refinement and boundary regression. Specifically, the prediction refinement progressively makes a better segmentation by adopting an iterative upsampling and a semantic feature  enhancement strategy. In the boundary regression, we employ class-agnostic edge maps for supervision to effectively guide the segmentation network in localizing the boundaries between different semantic regions, leading to producing finer-grained representation of feature maps for semantic segmentation. The experiment results on the PASCAL VOC 2012 demonstrate the proposed BPG achieves mIoU of 73.2% without fully connected Conditional Random Field (CRF) and 76.0% with CRF, setting up the new state-of-the-art in literature.


2021 ◽  
Vol 13 (19) ◽  
pp. 3900
Author(s):  
Haoran Wei ◽  
Xiangyang Xu ◽  
Ni Ou ◽  
Xinru Zhang ◽  
Yaping Dai

Remote sensing has now been widely used in various fields, and the research on the automatic land-cover segmentation methods of remote sensing imagery is significant to the development of remote sensing technology. Deep learning methods, which are developing rapidly in the field of semantic segmentation, have been widely applied to remote sensing imagery segmentation. In this work, a novel deep learning network—Dual Encoder with Attention Network (DEANet) is proposed. In this network, a dual-branch encoder structure, whose first branch is used to generate a rough guidance feature map as area attention to help re-encode feature maps in the next branch, is proposed to improve the encoding ability of the network, and an improved pyramid partial decoder (PPD) based on the parallel partial decoder is put forward to make fuller use of the features form the encoder along with the receptive filed block (RFB). In addition, an edge attention module using the transfer learning method is introduced to explicitly advance the segmentation performance in edge areas. Except for structure, a loss function composed with the weighted Cross Entropy (CE) loss and weighted Union subtract Intersection (UsI) loss is designed for training, where UsI loss represents a new region-based aware loss which replaces the IoU loss to adapt to multi-classification tasks. Furthermore, a detailed training strategy for the network is introduced as well. Extensive experiments on three public datasets verify the effectiveness of each proposed module in our framework and demonstrate that our method achieves more excellent performance over some state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document