scholarly journals Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition

Sensors ◽  
2020 ◽  
Vol 20 (11) ◽  
pp. 3126
Author(s):  
Jianyu Chen ◽  
Jun Kong ◽  
Hui Sun ◽  
Hui Xu ◽  
Xiaoli Liu ◽  
...  

Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.

Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 325
Author(s):  
Zhihao Wu ◽  
Baopeng Zhang ◽  
Tianchen Zhou ◽  
Yan Li ◽  
Jianping Fan

In this paper, we developed a practical approach for automatic detection of discrimination actions from social images. Firstly, an image set is established, in which various discrimination actions and relations are manually labeled. To the best of our knowledge, this is the first work to create a dataset for discrimination action recognition and relationship identification. Secondly, a practical approach is developed to achieve automatic detection and identification of discrimination actions and relationships from social images. Thirdly, the task of relationship identification is seamlessly integrated with the task of discrimination action recognition into one single network called the Co-operative Visual Translation Embedding++ network (CVTransE++). We also compared our proposed method with numerous state-of-the-art methods, and our experimental results demonstrated that our proposed methods can significantly outperform state-of-the-art approaches.


2019 ◽  
Vol 9 (12) ◽  
pp. 2535
Author(s):  
Di Fan ◽  
Hyunwoo Kim ◽  
Jummo Kim ◽  
Yunhui Liu ◽  
Qiang Huang

Face attributes prediction has an increasing amount of applications in human–computer interaction, face verification and video surveillance. Various studies show that dependencies exist in face attributes. Multi-task learning architecture can build a synergy among the correlated tasks by parameter sharing in the shared layers. However, the dependencies between the tasks have been ignored in the task-specific layers of most multi-task learning architectures. Thus, how to further boost the performance of individual tasks by using task dependencies among face attributes is quite challenging. In this paper, we propose a multi-task learning using task dependencies architecture for face attributes prediction and evaluate the performance with the tasks of smile and gender prediction. The designed attention modules in task-specific layers of our proposed architecture are used for learning task-dependent disentangled representations. The experimental results demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on Faces of the world (FotW) and Labeled faces in the wild-a (LFWA) datasets.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ying Li ◽  
Hang Sun ◽  
Shiyao Feng ◽  
Qi Zhang ◽  
Siyu Han ◽  
...  

Abstract Background Long noncoding RNAs (lncRNAs) play important roles in multiple biological processes. Identifying LncRNA–protein interactions (LPIs) is key to understanding lncRNA functions. Although some LPIs computational methods have been developed, the LPIs prediction problem remains challenging. How to integrate multimodal features from more perspectives and build deep learning architectures with better recognition performance have always been the focus of research on LPIs. Results We present a novel multichannel capsule network framework to integrate multimodal features for LPI prediction, Capsule-LPI. Capsule-LPI integrates four groups of multimodal features, including sequence features, motif information, physicochemical properties and secondary structure features. Capsule-LPI is composed of four feature-learning subnetworks and one capsule subnetwork. Through comprehensive experimental comparisons and evaluations, we demonstrate that both multimodal features and the architecture of the multichannel capsule network can significantly improve the performance of LPI prediction. The experimental results show that Capsule-LPI performs better than the existing state-of-the-art tools. The precision of Capsule-LPI is 87.3%, which represents a 1.7% improvement. The F-value of Capsule-LPI is 92.2%, which represents a 1.4% improvement. Conclusions This study provides a novel and feasible LPI prediction tool based on the integration of multimodal features and a capsule network. A webserver (http://csbg-jlu.site/lpc/predict) is developed to be convenient for users.


Author(s):  
Jianhai Zhang ◽  
Zhiyong Feng ◽  
Yong Su ◽  
Meng Xing

For the merits of high-order statistics and Riemannian geometry, covariance matrix has become a generic feature representation for action recognition. An independent action can be represented by an empirical statistics over all of its pose samples. Two major problems of covariance include the following: (1) it is prone to be singular so that actions fail to be represented properly, and (2) it is short of global action/pose-aware information so that expressive and discriminative power is limited. In this article, we propose a novel Bayesian covariance representation by a prior regularization method to solve the preceding problems. Specifically, covariance is viewed as a parametric maximum likelihood estimate of Gaussian distribution over local poses from an independent action. Then, a Global Informative Prior (GIP) is generated over global poses with sufficient statistics to regularize covariance. In this way, (1) singularity is greatly relieved due to sufficient statistics, (2) global pose information of GIP makes Bayesian covariance theoretically equivalent to a saliency weighting covariance over global action poses so that discriminative characteristics of actions can be represented more clearly. Experimental results show that our Bayesian covariance with GIP efficiently improves the performance of action recognition. In some databases, it outperforms the state-of-the-art variant methods that are based on kernels, temporal-order structures, and saliency weighting attentions, among others.


2021 ◽  
Author(s):  
Shreya Mishra ◽  
Raghav Awasthi ◽  
Frank Papay ◽  
Kamal Maheshawari ◽  
Jacek B Cywinski ◽  
...  

Question answering (QA) is one of the oldest research areas of AI and Compu- national Linguistics. QA has seen significant progress with the development of state-of-the-art models and benchmark datasets over the last few years. However, pre-trained QA models perform poorly for clinical QA tasks, presumably due to the complexity of electronic healthcare data. With the digitization of healthcare data and the increasing volume of unstructured data, it is extremely important for healthcare providers to have a mechanism to query the data to find appropriate answers. Since diagnosis is central to any decision-making for the clinicians and patients, we have created a pipeline to develop diagnosis-specific QA datasets and curated a QA database for the Cerebrovascular Accident (CVA). CVA, also commonly known as Stroke, is an important and commonly occurring diagnosis amongst critically ill patients. Our method when compared to clinician validation achieved an accuracy of 0.90(with 90% CI [0.82,0.99]). Using our method, we hope to overcome the key challenges of building and validating a highly accurate QA dataset in a semiautomated manner which can help improve performance of QA models.


2018 ◽  
Vol 5 (2) ◽  
pp. 258-270
Author(s):  
Aris Budianto

The Automatic License Plate Recognition (ALPR) has been becoming a new trend in transportation systems automation. The extraction of vehicle’s license plate can be done without human intervention. Despite such technology has been widely adopted in developed countries, developing countries remain a far-cry from implementing the sophisticated image and video recognition for some reasons. This paper discusses the challenges and possibilities of implementing Automatic License Plate Recognition within Indonesia’s circumstances. Previous knowledge suggested in the literature, and state of the art of the automatic recognition technology is amassed for consideration in future research and practice.


2020 ◽  
Vol 34 (07) ◽  
pp. 10941-10948
Author(s):  
Fei He ◽  
Naiyu Gao ◽  
Qiaozhe Li ◽  
Senyao Du ◽  
Xin Zhao ◽  
...  

Video object detection is a challenging task because of the presence of appearance deterioration in certain video frames. One typical solution is to aggregate neighboring features to enhance per-frame appearance features. However, such a method ignores the temporal relations between the aggregated frames, which is critical for improving video recognition accuracy. To handle the appearance deterioration problem, this paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection. To handle the displacement of the objects in videos, a novel DeformAlign module is proposed to align the spatial features from frame to frame. Instead of adopting a fixed-length window fusion strategy, a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information and requiring fewer video frames for aggregation to achieve better results. Our TCENet achieves state-of-the-art performance on the ImageNet VID dataset and has a faster runtime. Without bells-and-whistles, our TCENet achieves 80.3% mAP by only aggregating 3 frames.


2019 ◽  
Vol 277 ◽  
pp. 02034
Author(s):  
Sophie Aubry ◽  
Sohaib Laraba ◽  
Joëlle Tilmanne ◽  
Thierry Dutoit

In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.


Sign in / Sign up

Export Citation Format

Share Document