scholarly journals Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow

2020 ◽  
Vol 34 (07) ◽  
pp. 10713-10720
Author(s):  
Mingyu Ding ◽  
Zhe Wang ◽  
Bolei Zhou ◽  
Jianping Shi ◽  
Zhiwu Lu ◽  
...  

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame of a video clip is annotated, which makes most supervised methods fail to utilize information from the rest of the frames. To exploit the spatio-temporal information in videos, many previous works use pre-computed optical flows, which encode the temporal consistency to improve the video segmentation. However, the video segmentation and optical flow estimation are still considered as two separate tasks. In this paper, we propose a novel framework for joint video semantic segmentation and optical flow estimation. Semantic segmentation brings semantic information to handle occlusion for more robust optical flow estimation, while the non-occluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, our framework is able to utilize both labeled and unlabeled frames in the video through joint training, while no additional calculation is required in inference. Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks.

Author(s):  
Shanshan Zhao ◽  
Xi Li ◽  
Omar El Farouk Bourahla

As an important and challenging problem in computer vision, learning based optical flow estimation aims to discover the intrinsic correspondence structure between two adjacent video frames through statistical learning. Therefore, a key issue to solve in this area is how to effectively model the multi-scale correspondence structure properties in an adaptive end-to-end learning fashion. Motivated by this observation, we propose an end-to-end multi-scale correspondence structure learning (MSCSL) approach for optical flow estimation. In principle, the proposed MSCSL approach is capable of effectively capturing the multi-scale inter-image-correlation correspondence structures within a multi-level feature space from deep learning. Moreover, the proposed MSCSL approach builds a spatial Conv-GRU neural network model to adaptively model the intrinsic dependency relationships among these multi-scale correspondence structures. Finally, the above procedures for correspondence structure learning and multi-scale dependency modeling are implemented in a unified end-to-end deep learning framework. Experimental results on several benchmark datasets demonstrate the effectiveness of the proposed approach.


2012 ◽  
Vol 24 (4) ◽  
pp. 686-698 ◽  
Author(s):  
Lei Chen ◽  
◽  
Hua Yang ◽  
Takeshi Takaki ◽  
Idaku Ishii

In this paper, we propose a novel method for accurate optical flow estimation in real time for both high-speed and low-speed moving objects based on High-Frame-Rate (HFR) videos. We introduce a multiframe-straddling function to select several pairs of images with different frame intervals from an HFR image sequence even when the estimated optical flow is required to output at standard video rates (NTSC at 30 fps and PAL at 25 fps). The multiframestraddling function can remarkably improve the measurable range of velocities in optical flow estimation without heavy computation by adaptively selecting a small frame interval for high-speed objects and a large frame interval for low-speed objects. On the basis of the relationship between the frame intervals and the accuracies of the optical flows estimated by the Lucas–Kanade method, we devise a method to determine multiple frame intervals in optical flow estimation and select an optimal frame interval from these intervals according to the amplitude of the estimated optical flow. Our method was implemented using software on a high-speed vision platform, IDP Express. The estimated optical flows were accurately outputted at intervals of 40 ms in real time by using three pairs of 512×512 images; these images were selected by frame-straddling a 2000-fps video with intervals of 0.5, 1.5, and 5 ms. Several experiments were performed for high-speed movements to verify that our method can remarkably improve the measurable range of velocities in optical flow estimation, compared to optical flows estimated for 25-fps videos with the Lucas–Kanade method.


Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1150
Author(s):  
Jun Nagata ◽  
Yusuke Sekikawa ◽  
Yoshimitsu Aoki

In this work, we propose a novel method of estimating optical flow from event-based cameras by matching the time surface of events. The proposed loss function measures the timestamp consistency between the time surface formed by the latest timestamp of each pixel and the one that is slightly shifted in time. This makes it possible to estimate dense optical flows with high accuracy without restoring luminance or additional sensor information. In the experiment, we show that the gradient was more correct and the loss landscape was more stable than the variance loss in the motion compensation approach. In addition, we show that the optical flow can be estimated with high accuracy by optimization with L1 smoothness regularization using publicly available datasets.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jiangyun Li ◽  
Yikai Zhao ◽  
Xingjian He ◽  
Xinxin Zhu ◽  
Jing Liu

A major challenge for semantic video segmentation is how to exploit the spatiotemporal information and produce consistent results for a video sequence. Many previous works utilize the precomputed optical flow to warp the feature maps across adjacent frames. However, the imprecise optical flow and the warping operation without any learnable parameters may not achieve accurate feature warping and only bring a slight improvement. In this paper, we propose a novel framework named Dynamic Warping Network (DWNet) to adaptively warp the interframe features for improving the accuracy of warping-based models. Firstly, we design a flow refinement module (FRM) to optimize the precomputed optical flow. Then, we propose a flow-guided convolution (FG-Conv) to achieve the adaptive feature warping based on the refined optical flow. Furthermore, we introduce the temporal consistency loss including the feature consistency loss and prediction consistency loss to explicitly supervise the warped features instead of simple feature propagation and fusion, which guarantees the temporal consistency of video segmentation. Note that our DWNet adopts extra constraints to improve the temporal consistency in the training phase, while no additional calculation and postprocessing are required during inference. Extensive experiments show that our DWNet can achieve consistent improvement over various strong baselines and achieves state-of-the-art accuracy on the Cityscapes and CamVid benchmark datasets.


Sensors ◽  
2019 ◽  
Vol 19 (11) ◽  
pp. 2459 ◽  
Author(s):  
Ji-Hun Mun ◽  
Moongu Jeon ◽  
Byung-Geun Lee

Herein, we propose an unsupervised learning architecture under coupled consistency conditions to estimate the depth, ego-motion, and optical flow. Previously invented learning techniques in computer vision adopted a large amount of the ground truth dataset for network training. A ground truth dataset, including depth and optical flow collected from the real world, requires tremendous effort in pre-processing due to the exposure to noise artifacts. In this paper, we propose a framework that trains networks while using a different type of data with combined losses that are derived from a coupled consistency structure. The core concept is composed of two parts. First, we compare the optical flows, which are estimated from both the depth plus ego-motion and flow estimation network. Subsequently, to prevent the effects of the artifacts of the occluded regions in the estimated optical flow, we compute flow local consistency along the forward–backward directions. Second, synthesis consistency enables the exploration of the geometric correlation between the spatial and temporal domains in a stereo video. We perform extensive experiments on the depth, ego-motion, and optical flow estimation on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset. We verify that the flow local consistency loss improves the optical flow accuracy in terms of the occluded regions. Furthermore, we also show that the view-synthesis-based photometric loss enhances the depth and ego-motion accuracy via scene projection. The experimental results exhibit the competitive performance of the estimated depth and the optical flow; moreover, the induced ego-motion is comparable to that obtained from other unsupervised methods.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 400
Author(s):  
Sheng Lu ◽  
Zhaojie Luo ◽  
Feng Gao ◽  
Mingjie Liu ◽  
KyungHi Chang ◽  
...  

Lane detection is a significant technology for autonomous driving. In recent years, a number of lane detection methods have been proposed. However, the performance of fast and slim methods is not satisfactory in sophisticated scenarios and some robust methods are not fast enough. Consequently, we proposed a fast and robust lane detection method by combining a semantic segmentation network and an optical flow estimation network. Specifically, the whole research was divided into three parts: lane segmentation, lane discrimination, and mapping. In terms of lane segmentation, a robust semantic segmentation network was proposed to segment key frames and a fast and slim optical flow estimation network was used to track non-key frames. In the second part, density-based spatial clustering of applications with noise (DBSCAN) was adopted to discriminate lanes. Ultimately, we proposed a mapping method to map lane pixels from pixel coordinate system to camera coordinate system and fit lane curves in the camera coordinate system that are able to provide feedback for autonomous driving. Experimental results verified that the proposed method can speed up robust semantic segmentation network by three times at most and the accuracy fell 2% at most. In the best of circumstances, the result of the lane curve verified that the feedback error was 3%.


Sensors ◽  
2020 ◽  
Vol 20 (14) ◽  
pp. 3855
Author(s):  
Konstantinos Karageorgos ◽  
Anastasios Dimou ◽  
Federico Alvarez ◽  
Petros Daras

In this paper, two novel and practical regularizing methods are proposed to improve existing neural network architectures for monocular optical flow estimation. The proposed methods aim to alleviate deficiencies of current methods, such as flow leakage across objects and motion consistency within rigid objects, by exploiting contextual information. More specifically, the first regularization method utilizes semantic information during the training process to explicitly regularize the produced optical flow field. The novelty of this method lies in the use of semantic segmentation masks to teach the network to implicitly identify the semantic edges of an object and better reason on the local motion flow. A novel loss function is introduced that takes into account the objects’ boundaries as derived from the semantic segmentation mask to selectively penalize motion inconsistency within an object. The method is architecture agnostic and can be integrated into any neural network without modifying or adding complexity at inference. The second regularization method adds spatial awareness to the input data of the network in order to improve training stability and efficiency. The coordinates of each pixel are used as an additional feature, breaking the invariance properties of the neural network architecture. The additional features are shown to implicitly regularize the optical flow estimation enforcing a consistent flow, while improving both the performance and the convergence time. Finally, the combination of both regularization methods further improves the performance of existing cutting edge architectures in a complementary way, both quantitatively and qualitatively, on popular flow estimation benchmark datasets.


Author(s):  
Claudio S. Ravasio ◽  
Theodoros Pissas ◽  
Edward Bloch ◽  
Blanca Flores ◽  
Sepehr Jalali ◽  
...  

Abstract Purpose Sustained delivery of regenerative retinal therapies by robotic systems requires intra-operative tracking of the retinal fundus. We propose a supervised deep convolutional neural network to densely predict semantic segmentation and optical flow of the retina as mutually supportive tasks, implicitly inpainting retinal flow information missing due to occlusion by surgical tools. Methods As manual annotation of optical flow is infeasible, we propose a flexible algorithm for generation of large synthetic training datasets on the basis of given intra-operative retinal images. We evaluate optical flow estimation by tracking a grid and sparsely annotated ground truth points on a benchmark of challenging real intra-operative clips obtained from an extensive internally acquired dataset encompassing representative vitreoretinal surgical cases. Results The U-Net-based network trained on the synthetic dataset is shown to generalise well to the benchmark of real surgical videos. When used to track retinal points of interest, our flow estimation outperforms variational baseline methods on clips containing tool motions which occlude the points of interest, as is routinely observed in intra-operatively recorded surgery videos. Conclusions The results indicate that complex synthetic training datasets can be used to specifically guide optical flow estimation. Our proposed algorithm therefore lays the foundation for a robust system which can assist with intra-operative tracking of moving surgical targets even when occluded.


Sign in / Sign up

Export Citation Format

Share Document