scholarly journals DSTnet: Deformable Spatio-Temporal Convolutional Residual Network for Video Super-Resolution

Mathematics ◽  
2021 ◽  
Vol 9 (22) ◽  
pp. 2873
Author(s):  
Anusha Khan ◽  
Allah Bux Sargano ◽  
Zulfiqar Habib

Video super-resolution (VSR) aims at generating high-resolution (HR) video frames with plausible and temporally consistent details using their low-resolution (LR) counterparts, and neighboring frames. The key challenge for VSR lies in the effective exploitation of intra-frame spatial relation and temporal dependency between consecutive frames. Many existing techniques utilize spatial and temporal information separately and compensate motion via alignment. These methods cannot fully exploit the spatio-temporal information that significantly affects the quality of resultant HR videos. In this work, a novel deformable spatio-temporal convolutional residual network (DSTnet) is proposed to overcome the issues of separate motion estimation and compensation methods for VSR. The proposed framework consists of 3D convolutional residual blocks decomposed into spatial and temporal (2+1) D streams. This decomposition can simultaneously utilize input video’s spatial and temporal features without a separate motion estimation and compensation module. Furthermore, the deformable convolution layers have been used in the proposed model that enhances its motion-awareness capability. Our contribution is twofold; firstly, the proposed approach can overcome the challenges in modeling complex motions by efficiently using spatio-temporal information. Secondly, the proposed model has fewer parameters to learn than state-of-the-art methods, making it a computationally lean and efficient framework for VSR. Experiments are conducted on a benchmark Vid4 dataset to evaluate the efficacy of the proposed approach. The results demonstrate that the proposed approach achieves superior quantitative and qualitative performance compared to the state-of-the-art methods.


2018 ◽  
Vol 4 (9) ◽  
pp. 107 ◽  
Author(s):  
Mohib Ullah ◽  
Ahmed Mohammed ◽  
Faouzi Alaya Cheikh

Articulation modeling, feature extraction, and classification are the important components of pedestrian segmentation. Usually, these components are modeled independently from each other and then combined in a sequential way. However, this approach is prone to poor segmentation if any individual component is weakly designed. To cope with this problem, we proposed a spatio-temporal convolutional neural network named PedNet which exploits temporal information for spatial segmentation. The backbone of the PedNet consists of an encoder–decoder network for downsampling and upsampling the feature maps, respectively. The input to the network is a set of three frames and the output is a binary mask of the segmented regions in the middle frame. Irrespective of classical deep models where the convolution layers are followed by a fully connected layer for classification, PedNet is a Fully Convolutional Network (FCN). It is trained end-to-end and the segmentation is achieved without the need of any pre- or post-processing. The main characteristic of PedNet is its unique design where it performs segmentation on a frame-by-frame basis but it uses the temporal information from the previous and the future frame for segmenting the pedestrian in the current frame. Moreover, to combine the low-level features with the high-level semantic information learned by the deeper layers, we used long-skip connections from the encoder to decoder network and concatenate the output of low-level layers with the higher level layers. This approach helps to get segmentation map with sharp boundaries. To show the potential benefits of temporal information, we also visualized different layers of the network. The visualization showed that the network learned different information from the consecutive frames and then combined the information optimally to segment the middle frame. We evaluated our approach on eight challenging datasets where humans are involved in different activities with severe articulation (football, road crossing, surveillance). The most common CamVid dataset which is used for calculating the performance of the segmentation algorithm is evaluated against seven state-of-the-art methods. The performance is shown on precision/recall, F 1 , F 2 , and mIoU. The qualitative and quantitative results show that PedNet achieves promising results against state-of-the-art methods with substantial improvement in terms of all the performance metrics.



Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2085
Author(s):  
Lei Han ◽  
Cien Fan ◽  
Ye Yang ◽  
Lian Zou

Recently, convolutional neural networks have made a remarkable performance for video super-resolution. However, how to exploit the spatial and temporal information of video efficiently and effectively remains challenging. In this work, we design a bidirectional temporal-recurrent propagation unit. The bidirectional temporal-recurrent propagation unit makes it possible to flow temporal information in an RNN-like manner from frame to frame, which avoids complex motion estimation modeling and motion compensation. To better fuse the information of the two temporal-recurrent propagation units, we use channel attention mechanisms. Additionally, we recommend a progressive up-sampling method instead of one-step up-sampling. We find that progressive up-sampling gets better experimental results than one-stage up-sampling. Extensive experiments show that our algorithm outperforms several recent state-of-the-art video super-resolution (VSR) methods with a smaller model size.



2021 ◽  
Vol 11 (8) ◽  
pp. 3636
Author(s):  
Faria Zarin Subah ◽  
Kaushik Deb ◽  
Pranab Kumar Dhar ◽  
Takeshi Koshiba

Autism spectrum disorder (ASD) is a complex and degenerative neuro-developmental disorder. Most of the existing methods utilize functional magnetic resonance imaging (fMRI) to detect ASD with a very limited dataset which provides high accuracy but results in poor generalization. To overcome this limitation and to enhance the performance of the automated autism diagnosis model, in this paper, we propose an ASD detection model using functional connectivity features of resting-state fMRI data. Our proposed model utilizes two commonly used brain atlases, Craddock 200 (CC200) and Automated Anatomical Labelling (AAL), and two rarely used atlases Bootstrap Analysis of Stable Clusters (BASC) and Power. A deep neural network (DNN) classifier is used to perform the classification task. Simulation results indicate that the proposed model outperforms state-of-the-art methods in terms of accuracy. The mean accuracy of the proposed model was 88%, whereas the mean accuracy of the state-of-the-art methods ranged from 67% to 85%. The sensitivity, F1-score, and area under receiver operating characteristic curve (AUC) score of the proposed model were 90%, 87%, and 96%, respectively. Comparative analysis on various scoring strategies show the superiority of BASC atlas over other aforementioned atlases in classifying ASD and control.



Author(s):  
Guoan Cheng ◽  
Ai Matsune ◽  
Huaijuan Zang ◽  
Toru Kurihara ◽  
Shu Zhan

In this paper, we propose an enhanced dual path attention network (EDPAN) for image super-resolution. ResNet is good at implicitly reusing extracted features, DenseNet is good at exploring new features. Dual Path Network (DPN) combines ResNets and DenseNet to create a more accurate architecture than the straightforward one. We experimentally show that the residual network performs best when each block consists of two convolutions, and the dense network performs best when each micro-block consists of one convolution. Following these ideas, our EDPAN exploits the advantages of the residual structure and the dense structure. Besides, to deploy the computations for features more effectively, we introduce the attention mechanism into our EDPAN. Moreover, to relieve the parameters burden, we also utilize recursive learning to propose a lightweight model. In the experiments, we demonstrate the effectiveness and robustness of our proposed EDPAN on different degradation situations. The quantitative results and visualization comparison can sufficiently indicate that our EDPAN achieves favorable performance over the state-of-the-art frameworks.



2009 ◽  
Vol 42 (2) ◽  
pp. 267-282 ◽  
Author(s):  
W. Ren ◽  
S. Singh ◽  
M. Singh ◽  
Y.S. Zhu


Author(s):  
Liangchen Luo ◽  
Wenhao Huang ◽  
Qi Zeng ◽  
Zaiqing Nie ◽  
Xu Sun

Most existing works on dialog systems only consider conversation content while neglecting the personality of the user the bot is interacting with, which begets several unsolved issues. In this paper, we present a personalized end-to-end model in an attempt to leverage personalization in goal-oriented dialogs. We first introduce a PROFILE MODEL which encodes user profiles into distributed embeddings and refers to conversation history from other similar users. Then a PREFERENCE MODEL captures user preferences over knowledge base entities to handle the ambiguity in user requests. The two models are combined into the PERSONALIZED MEMN2N. Experiments show that the proposed model achieves qualitative performance improvements over state-of-the-art methods. As for human evaluation, it also outperforms other approaches in terms of task completion rate and user satisfaction.



2015 ◽  
Vol 6 (1) ◽  
Author(s):  
Luca Lanzanò ◽  
Iván Coto Hernández ◽  
Marco Castello ◽  
Enrico Gratton ◽  
Alberto Diaspro ◽  
...  


2020 ◽  
Author(s):  
Jawad Khan

Activity recognition is a topic undergoing massive research in the field of computer vision. Applications of activity recognition include sports summaries, human-computer interaction, violence detection, surveillance etc. In this paper, we propose the modification of the standard local binary patterns descriptor to obtain a concatenated histogram of lower dimensions. This helps to encode the spatial and temporal information of various actions happening in a frame. This method helps to overcome the dimensionality problem that occurs with LBP and the results show that the proposed method performed comparably with state of the art methods.



2020 ◽  
Vol 34 (07) ◽  
pp. 11966-11973
Author(s):  
Hao Shao ◽  
Shengju Qian ◽  
Yu Liu

For a long time, the vision community tries to learn the spatio-temporal representation by combining convolutional neural network together with various temporal models, such as the families of Markov chain, optical flow, RNN and temporal convolution. However, these pipelines consume enormous computing resources due to the alternately learning process for spatial and temporal information. One natural question is whether we can embed the temporal information into the spatial one so the information in the two domains can be jointly learned once-only. In this work, we answer this question by presenting a simple yet powerful operator – temporal interlacing network (TIN). Instead of learning the temporal features, TIN fuses the two kinds of information by interlacing spatial representations from the past to the future, and vice versa. A differentiable interlacing target can be learned to control the interlacing process. In this way, a heavy temporal model is replaced by a simple interlacing operator. We theoretically prove that with a learnable interlacing target, TIN performs equivalently to the regularized temporal convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on 6 challenging benchmarks. These results push the state-of-the-art performances of video understanding by a considerable margin. Not surprising, the ensemble model of the proposed TIN won the 1st place in the ICCV19 - Multi Moments in Time challenge. Code is made available to facilitate further research.1



2021 ◽  
Author(s):  
Marzieh Zare

Fusing a low spatial resolution hyperspectral image (HSI) with a high spatial resolution multispectral image (MSI) to produce a fused high spatio-spectal resolution one, referred to as HSI super-resolution, has recently attracted increasing research interests. In this paper, a new method based on coupled non-negative tensor decomposition (CNTD) is proposed. The proposed method uses tucker tensor factorization for low resolution hyperspectral image (LR-HSI) and high resolution multispectral image (HR-MSI) under the constraint of non-negative tensor ecomposition (NTD). The conventional non-negative matrix factorization (NMF) method essentially loses spatio-spectral joint structure information when stacking a 3D data into a matrix form. On the contrary, in NMF-based methods, the spectral, spatial, or their joint structures must be imposed from outside as a constraint to well pose the NMF problem, The proposed CNTD method blindly brings the advantage of preserving the spatio-spectral joint structure of HSIs. In this paper, the NTD is imposed on the coupled tensor of HIS and MSI straightly. Hence the intrinsic spatio-spectral joint structure of HSI can be losslessly expressed and interdependently exploited. Furthermore, multilinear interactions of different modes of the HSIs can be exactly modeled by means of the core tensor of the Tucker tensor decomposition. The proposed method is completely straight forward and easy to implement. Unlike the other state-of-the-art methods, the complexity of the proposed CNTD method is quite linear with the size of the HSI cube. Compared with the state-of-the-art methods experiments on two well-known datasets, give promising results with lower complexity order.



Sign in / Sign up

Export Citation Format

Share Document