scholarly journals Semantic Adversarial Network with Multi-Scale Pyramid Attention for Video Classification

Author(s):  
De Xie ◽  
Cheng Deng ◽  
Hao Wang ◽  
Chao Li ◽  
Dapeng Tao

Two-stream architecture have shown strong performance in video classification task. The key idea is to learn spatiotemporal features by fusing convolutional networks spatially and temporally. However, there are some problems within such architecture. First, it relies on optical flow to model temporal information, which are often expensive to compute and store. Second, it has limited ability to capture details and local context information for video data. Third, it lacks explicit semantic guidance that greatly decrease the classification performance. In this paper, we proposed a new two-stream based deep framework for video classification to discover spatial and temporal information only from RGB frames, moreover, the multi-scale pyramid attention (MPA) layer and the semantic adversarial learning (SAL) module is introduced and integrated in our framework. The MPA enables the network capturing global and local feature to generate a comprehensive representation for video, and the SAL can make this representation gradually approximate to the real video semantics in an adversarial manner. Experimental results on two public benchmarks demonstrate our proposed methods achieves state-of-the-art results on standard video datasets.

2020 ◽  
Vol 10 (3) ◽  
pp. 966
Author(s):  
Zeyu Jiao ◽  
Guozhu Jia ◽  
Yingjie Cai

In this study, we consider fully automated action recognition based on deep learning in the industrial environment. In contrast to most existing methods, which rely on professional knowledge to construct complex hand-crafted features, or only use basic deep-learning methods, such as convolutional neural networks (CNNs), to extract information from images in the production process, we exploit a novel and effective method, which integrates multiple deep-learning networks including CNNs, spatial transformer networks (STNs), and graph convolutional networks (GCNs) to process video data in industrial workflows. The proposed method extracts both spatial and temporal information from video data. The spatial information is extracted by estimating the human pose of each frame, and the skeleton image of the human body in each frame is obtained. Furthermore, multi-frame skeleton images are processed by GCN to obtain temporal information, meaning the action recognition results are predicted automatically. By training on a large human action dataset, Kinetics, we apply the proposed method to the real-world industrial environment and achieve superior performance compared with the existing methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Zhang Min-qing ◽  
Li Wen-ping

There are many different types of sports training films, and categorizing them can be difficult. As a result, this research introduces an autonomous video content classification system that makes managing large amounts of video data easier. This research provides a video feature extraction approach using a support vector machine (SVM) video classification algorithm and a mix of video and audio dual-mode characteristics. It automates the classification of cartoons, ads, music, news, and sports videos, as well as the detection of terrorist and violent moments in films. To begin, a new feature expression scheme, the MPEG-7 visual descriptor subcombination, is proposed based on an analysis of the existing video classification algorithms, with the goal of addressing the problems in these algorithms. This is accomplished by analyzing the visual differences of the five video classification algorithms. The model was able to extract 9 descriptors from the four characteristics of color, texture, shape, and motion, resulting in a new overall visual feature with good results. The results suggest that the algorithm optimizes video segmentation by highlighting disparities in feature selection between different categories of films. Second, the support vector machine’s multivideo classification performance is improved by the enhanced secondary prediction method. Finally, a comparison experiment with current related similar algorithms was conducted. The suggested method outperformed the competition in the accuracy of video classification in five different types of videos, as well as in the recognition of terrorist and violent incidents.


Sensors ◽  
2020 ◽  
Vol 20 (10) ◽  
pp. 2907 ◽  
Author(s):  
Chih-Yang Lin ◽  
Yi-Cheng Chiu ◽  
Hui-Fuang Ng ◽  
Timothy K. Shih ◽  
Kuan-Hung Lin

Semantic segmentation of street view images is an important step in scene understanding for autonomous vehicle systems. Recent works have made significant progress in pixel-level labeling using Fully Convolutional Network (FCN) framework and local multi-scale context information. Rich global context information is also essential in the segmentation process. However, a systematic way to utilize both global and local contextual information in a single network has not been fully investigated. In this paper, we propose a global-and-local network architecture (GLNet) which incorporates global spatial information and dense local multi-scale context information to model the relationship between objects in a scene, thus reducing segmentation errors. A channel attention module is designed to further refine the segmentation results using low-level features from the feature map. Experimental results demonstrate that our proposed GLNet achieves 80.8% test accuracy on the Cityscapes test dataset, comparing favorably with existing state-of-the-art methods.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4365
Author(s):  
Kwangyong Jung ◽  
Jae-In Lee ◽  
Nammoon Kim ◽  
Sunjin Oh ◽  
Dong-Wook Seo

Radar target classification is an important task in the missile defense system. State-of-the-art studies using micro-doppler frequency have been conducted to classify the space object targets. However, existing studies rely highly on feature extraction methods. Therefore, the generalization performance of the classifier is limited and there is room for improvement. Recently, to improve the classification performance, the popular approaches are to build a convolutional neural network (CNN) architecture with the help of transfer learning and use the generative adversarial network (GAN) to increase the training datasets. However, these methods still have drawbacks. First, they use only one feature to train the network. Therefore, the existing methods cannot guarantee that the classifier learns more robust target characteristics. Second, it is difficult to obtain large amounts of data that accurately mimic real-world target features by performing data augmentation via GAN instead of simulation. To mitigate the above problem, we propose a transfer learning-based parallel network with the spectrogram and the cadence velocity diagram (CVD) as the inputs. In addition, we obtain an EM simulation-based dataset. The radar-received signal is simulated according to a variety of dynamics using the concept of shooting and bouncing rays with relative aspect angles rather than the scattering center reconstruction method. Our proposed model is evaluated on our generated dataset. The proposed method achieved about 0.01 to 0.39% higher accuracy than the pre-trained networks with a single input feature.


2021 ◽  
Vol 2021 (1) ◽  
Author(s):  
Wenyi Wang ◽  
Jun Hu ◽  
Xiaohong Liu ◽  
Jiying Zhao ◽  
Jianwen Chen

AbstractIn this paper, we propose a hybrid super-resolution method by combining global and local dictionary training in the sparse domain. In order to present and differentiate the feature mapping in different scales, a global dictionary set is trained in multiple structure scales, and a non-linear function is used to choose the appropriate dictionary to initially reconstruct the HR image. In addition, we introduce the Gaussian blur to the LR images to eliminate a widely used but inappropriate assumption that the low resolution (LR) images are generated by bicubic interpolation from high-resolution (HR) images. In order to deal with Gaussian blur, a local dictionary is generated and iteratively updated by K-means principal component analysis (K-PCA) and gradient decent (GD) to model the blur effect during the down-sampling. Compared with the state-of-the-art SR algorithms, the experimental results reveal that the proposed method can produce sharper boundaries and suppress undesired artifacts with the present of Gaussian blur. It implies that our method could be more effect in real applications and that the HR-LR mapping relation is more complicated than bicubic interpolation.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 653
Author(s):  
Ruihua Zhang ◽  
Fan Yang ◽  
Yan Luo ◽  
Jianyi Liu ◽  
Jinbin Li ◽  
...  

Thorax disease classification is a challenging task due to complex pathologies and subtle texture changes, etc. It has been extensively studied for years largely because of its wide application in computer-aided diagnosis. Most existing methods directly learn global feature representations from whole Chest X-ray (CXR) images, without considering in depth the richer visual cues lying around informative local regions. Thus, these methods often produce sub-optimal thorax disease classification performance because they ignore the very informative pathological changes around organs. In this paper, we propose a novel Part-Aware Mask-Guided Attention Network (PMGAN) that learns complementary global and local feature representations from all-organ region and multiple single-organ regions simultaneously for thorax disease classification. Specifically, multiple innovative soft attention modules are designed to progressively guide feature learning toward the global informative regions of whole CXR image. A mask-guided attention module is designed to further search for informative regions and visual cues within the all-organ or single-organ images, where attention is elegantly regularized by automatically generated organ masks and without introducing computation during the inference stage. In addition, a multi-task learning strategy is designed, which effectively maximizes the learning of complementary local and global representations. The proposed PMGAN has been evaluated on the ChestX-ray14 dataset and the experimental results demonstrate its superior thorax disease classification performance against the state-of-the-art methods.


2018 ◽  
Vol 10 (8) ◽  
pp. 80
Author(s):  
Lei Zhang ◽  
Xiaoli Zhi

Convolutional neural networks (CNN for short) have made great progress in face detection. They mostly take computation intensive networks as the backbone in order to obtain high precision, and they cannot get a good detection speed without the support of high-performance GPUs (Graphics Processing Units). This limits CNN-based face detection algorithms in real applications, especially in some speed dependent ones. To alleviate this problem, we propose a lightweight face detector in this paper, which takes a fast residual network as backbone. Our method can run fast even on cheap and ordinary GPUs. To guarantee its detection precision, multi-scale features and multi-context are fully exploited in efficient ways. Specifically, feature fusion is used to obtain semantic strongly multi-scale features firstly. Then multi-context including both local and global context is added to these multi-scale features without extra computational burden. The local context is added through a depthwise separable convolution based approach, and the global context by a simple global average pooling way. Experimental results show that our method can run at about 110 fps on VGA (Video Graphics Array)-resolution images, while still maintaining competitive precision on WIDER FACE and FDDB (Face Detection Data Set and Benchmark) datasets as compared with its state-of-the-art counterparts.


Author(s):  
Peng Han ◽  
Shuo Shang ◽  
Aixin Sun ◽  
Peilin Zhao ◽  
Kai Zheng ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document