scholarly journals Unsupervised Learning of Depth and Camera Pose with Feature Map Warping

Sensors ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 923
Author(s):  
Ente Guo ◽  
Zhifeng Chen ◽  
Yanlin Zhou ◽  
Dapeng Oliver Wu

Estimating the depth of image and egomotion of agent are important for autonomous and robot in understanding the surrounding environment and avoiding collision. Most existing unsupervised methods estimate depth and camera egomotion by minimizing photometric error between adjacent frames. However, the photometric consistency sometimes does not meet the real situation, such as brightness change, moving objects and occlusion. To reduce the influence of brightness change, we propose a feature pyramid matching loss (FPML) which captures the trainable feature error between a current and the adjacent frames and therefore it is more robust than photometric error. In addition, we propose the occlusion-aware mask (OAM) network which can indicate occlusion according to change of masks to improve estimation accuracy of depth and camera pose. The experimental results verify that the proposed unsupervised approach is highly competitive against the state-of-the-art methods, both qualitatively and quantitatively. Specifically, our method reduces absolute relative error (Abs Rel) by 0.017–0.088.

Sensors ◽  
2020 ◽  
Vol 20 (13) ◽  
pp. 3737
Author(s):  
Lu Xiong ◽  
Yongkun Wen ◽  
Yuyao Huang ◽  
Junqiao Zhao ◽  
Wei Tian

We propose a completely unsupervised approach to simultaneously estimate scene depth, ego-pose, ground segmentation and ground normal vector from only monocular RGB video sequences. In our approach, estimation for different scene structures can mutually benefit each other by the joint optimization. Specifically, we use the mutual information loss to pre-train the ground segmentation network and before adding the corresponding self-learning label obtained by a geometric method. By using the static nature of the ground and its normal vector, the scene depth and ego-motion can be efficiently learned by the self-supervised learning procedure. Extensive experimental results on both Cityscapes and KITTI benchmark demonstrate the significant improvement on the estimation accuracy for both scene depth and ego-pose by our approach. We also achieve an average error of about 3 ∘ for estimated ground normal vectors. By deploying our proposed geometric constraints, the IOU accuracy of unsupervised ground segmentation is increased by 35% on the Cityscapes dataset.


2020 ◽  
Vol 10 (4) ◽  
pp. 1467
Author(s):  
Chao Sheng ◽  
Shuguo Pan ◽  
Wang Gao ◽  
Yong Tan ◽  
Tao Zhao

Traditional Simultaneous Localization and Mapping (SLAM) (with loop closure detection), or Visual Odometry (VO) (without loop closure detection), are based on the static environment assumption. When working in dynamic environments, they perform poorly whether using direct methods or indirect methods (feature points methods). In this paper, Dynamic-DSO which is a semantic monocular direct visual odometry based on DSO (Direct Sparse Odometry) is proposed. The proposed system is completely implemented with the direct method, which is different from the most current dynamic systems combining the indirect method with deep learning. Firstly, convolutional neural networks (CNNs) are applied to the original RGB image to generate the pixel-wise semantic information of dynamic objects. Then, based on the semantic information of the dynamic objects, dynamic candidate points are filtered out in keyframes candidate points extraction; only static candidate points are reserved in the tracking and optimization module, to achieve accurate camera pose estimation in dynamic environments. The photometric error calculated by the projection points in dynamic region of subsequent frames are removed from the whole photometric error in pyramid motion tracking model. Finally, the sliding window optimization which neglects the photometric error calculated in the dynamic region of each keyframe is applied to obtain the precise camera pose. Experiments on the public TUM dynamic dataset and the modified Euroc dataset show that the positioning accuracy and robustness of the proposed Dynamic-DSO is significantly higher than the state-of-the-art direct method in dynamic environments, and the semi-dense cloud map constructed by Dynamic-DSO is clearer and more detailed.


2021 ◽  
Vol 11 (2) ◽  
pp. 645
Author(s):  
Xujie Kang ◽  
Jing Li ◽  
Xiangtao Fan ◽  
Hongdeng Jian ◽  
Chen Xu

Visual simultaneous localization and mapping (SLAM) is challenging in dynamic environments as moving objects can impair camera pose tracking and mapping. This paper introduces a method for robust dense bject-level SLAM in dynamic environments that takes a live stream of RGB-D frame data as input, detects moving objects, and segments the scene into different objects while simultaneously tracking and reconstructing their 3D structures. This approach provides a new method of dynamic object detection, which integrates prior knowledge of the object model database constructed, object-oriented 3D tracking against the camera pose, and the association between the instance segmentation results on the current frame data and an object database to find dynamic objects in the current frame. By leveraging the 3D static model for frame-to-model alignment, as well as dynamic object culling, the camera motion estimation reduced the overall drift. According to the camera pose accuracy and instance segmentation results, an object-level semantic map representation was constructed for the world map. The experimental results obtained using the TUM RGB-D dataset, which compares the proposed method to the related state-of-the-art approaches, demonstrating that our method achieves similar performance in static scenes and improved accuracy and robustness in dynamic scenes.


2019 ◽  
Author(s):  
Holger Meuel

Kurzzusammenfassung: This work deals with affine motion-compensated prediction (MCP) in video coding. Using the rate-distortion theory and the displacement estimation error caused by inaccurate motion parameter estimation, the minimum bit rate for encoding the prediction error is derived. Similarly, a 4-parameter simplified affine model as considered for the upcoming video coding standard VVC is analyzed. Both models provide valuable information about the minimum bit rate for encoding the prediction error as a function of the motion estimation accuracy. Although the bit rate in MCP can be reduced by using a motion model capable of describing the motion in the scene, the total video bit rate may remain high. Thus, a codec independent coding system is proposed for aerial videos, which exploits the planarity of such sequences. Only new emerging areas and moving objects in each frame are encoded. From these, the decoder reconstructs a mosaic, from which video frames are extracted again. Th...


Electronics ◽  
2021 ◽  
Vol 10 (24) ◽  
pp. 3177
Author(s):  
Venkat Anil Adibhatla ◽  
Yu-Chieh Huang ◽  
Ming-Chung Chang ◽  
Hsu-Chi Kuo ◽  
Abhijeet Utekar ◽  
...  

Deep learning methods are currently used in industries to improve the efficiency and quality of the product. Detecting defects on printed circuit boards (PCBs) is a challenging task and is usually solved by automated visual inspection, automated optical inspection, manual inspection, and supervised learning methods, such as you only look once (YOLO) of tiny YOLO, YOLOv2, YOLOv3, YOLOv4, and YOLOv5. Previously described methods for defect detection in PCBs require large numbers of labeled images, which is computationally expensive in training and requires a great deal of human effort to label the data. This paper introduces a new unsupervised learning method for the detection of defects in PCB using student–teacher feature pyramid matching as a pre-trained image classification model used to learn the distribution of images without anomalies. Hence, we extracted the knowledge into a student network which had same architecture as the teacher network. This one-step transfer retains key clues as much as possible. In addition, we incorporated a multi-scale feature matching strategy into the framework. A mixture of multi-level knowledge from the features pyramid passes through a better supervision, known as hierarchical feature alignment, which allows the student network to receive it, thereby allowing for the detection of various sizes of anomalies. A scoring function reflects the probability of the occurrence of anomalies. This framework helped us to achieve accurate anomaly detection. Apart from accuracy, its inference speed also reached around 100 frames per second.


2005 ◽  
Vol 19 (3) ◽  
pp. 216-231 ◽  
Author(s):  
Albertus A. Wijers ◽  
Maarten A.S. Boksem

Abstract. We recorded event-related potentials in an illusory conjunction task, in which subjects were cued on each trial to search for a particular colored letter in a subsequently presented test array, consisting of three different letters in three different colors. In a proportion of trials the target letter was present and in other trials none of the relevant features were present. In still other trials one of the features (color or letter identity) were present or both features were present but not combined in the same display element. When relevant features were present this resulted in an early posterior selection negativity (SN) and a frontal selection positivity (FSP). When a target was presented, this resulted in a FSP that was enhanced after 250 ms as compared to when both relevant features were present but not combined in the same display element. This suggests that this effect reflects an extra process of attending to both features bound to the same object. There were no differences between the ERPs in feature error and conjunction error trials, contrary to the idea that these two types of errors are due to different (perceptual and attentional) mechanisms. The P300 in conjunction error trials was much reduced relative to the P300 in correct target detection trials. A similar, error-related negativity-like component was visible in the response-locked averages in correct target detection trials, in feature error trials, and in conjunction error trials. Dipole modeling of this component resulted in a source in a deep medial-frontal location. These results suggested that this type of task induces a high level of response conflict, in which decision-related processes may play a major role.


Sign in / Sign up

Export Citation Format

Share Document