scholarly journals Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition

Author(s):  
Xinge Zhu ◽  
Liang Li ◽  
Weigang Zhang ◽  
Tianrong Rao ◽  
Min Xu ◽  
...  

Visual emotion recognition aims to associate images with appropriate emotions. There are different visual stimuli that can affect human emotion from low-level to high-level, such as color, texture, part, object, etc. However, most existing methods treat different levels of features as independent entity without having effective method for feature fusion. In this paper, we propose a unified CNN-RNN model to predict the emotion based on the fused features from different levels by exploiting the dependency among them. Our proposed architecture leverages convolutional neural network (CNN) with multiple layers to extract different levels of features with in a multi-task learning framework, in which two related loss functions are introduced to learn the feature representation. Considering the dependencies within the low-level and high-level features, a new bidirectional recurrent neural network (RNN) is proposed to integrate the learned features from different layers in the CNN model. Extensive experiments on both Internet images and art photo datasets demonstrate that our method outperforms the state-of-the-art methods with at least 7% performance improvement.

2018 ◽  
Vol 8 (12) ◽  
pp. 2367 ◽  
Author(s):  
Hongling Luo ◽  
Jun Sang ◽  
Weiqun Wu ◽  
Hong Xiang ◽  
Zhili Xiang ◽  
...  

In recent years, the trampling events due to overcrowding have occurred frequently, which leads to the demand for crowd counting under a high-density environment. At present, there are few studies on monitoring crowds in a large-scale crowded environment, while there exists technology drawbacks and a lack of mature systems. Aiming to solve the crowd counting problem with high-density under complex environments, a feature fusion-based deep convolutional neural network method FF-CNN (Feature Fusion of Convolutional Neural Network) was proposed in this paper. The proposed FF-CNN mapped the crowd image to its crowd density map, and then obtained the head count by integration. The geometry adaptive kernels were adopted to generate high-quality density maps which were used as ground truths for network training. The deconvolution technique was used to achieve the fusion of high-level and low-level features to get richer features, and two loss functions, i.e., density map loss and absolute count loss, were used for joint optimization. In order to increase the sample diversity, the original images were cropped with a random cropping method for each iteration. The experimental results of FF-CNN on the ShanghaiTech public dataset showed that the fusion of low-level and high-level features can extract richer features to improve the precision of density map estimation, and further improve the accuracy of crowd counting.


2019 ◽  
Vol 9 (20) ◽  
pp. 4209 ◽  
Author(s):  
Yongmei Ren ◽  
Jie Yang ◽  
Qingnian Zhang ◽  
Zhiqiang Guo

The appearance of ships is easily affected by external factors—illumination, weather conditions, and sea state—that make ship classification a challenging task. To facilitate realization of enhanced ship-classification performance, this study proposes a ship classification method based on multi-feature fusion with a convolutional neural network (CNN). First, an improved CNN characterized by shallow layers and few parameters is proposed to learn high-level features and capture structural information. Second, handcrafted features of the histogram of oriented gradients (HOG) and local binary patterns (LBP) are combined with high-level features extracted by the improved CNN in the last fully connected layer to obtain discriminative feature representation. The handcrafted features supplement the edge information and spatial texture information of the ship images. Then, the Softmax function is used to classify different types of ships in the output layer. Effectiveness of the proposed method is evaluated based on its application to two datasets—one self-built and the other publicly available, called visible and infrared spectrums (VAIS). As observed, the proposed method demonstrated attainment of average classification accuracies equal to 97.50% and 93.60%, respectively, when applied to these datasets. Additionally, results obtained in terms of the F1-score and confusion matrix demonstrate the proposed method to be superior to some state-of-the-art methods.


2021 ◽  
Vol 11 (3) ◽  
pp. 1223
Author(s):  
Ilshat Khasanshin

This work aimed to study the automation of measuring the speed of punches of boxers during shadow boxing using inertial measurement units (IMUs) based on an artificial neural network (ANN). In boxing, for the effective development of an athlete, constant control of the punch speed is required. However, even when using modern means of measuring kinematic parameters, it is necessary to record the circumstances under which the punch was performed: The type of punch (jab, cross, hook, or uppercut) and the type of activity (shadow boxing, single punch, or series of punches). Therefore, to eliminate errors and accelerate the process, that is, automate measurements, the use of an ANN in the form of a multilayer perceptron (MLP) is proposed. During the experiments, IMUs were installed on the boxers’ wrists. The input parameters of the ANN were the absolute acceleration and angular velocity. The experiment was conducted for three groups of boxers with different levels of training. The developed model showed a high level of punch recognition for all groups, and it can be concluded that the use of the ANN significantly accelerates the collection of data on the kinetic characteristics of boxers’ punches and allows this process to be automated.


2018 ◽  
Vol 2018 ◽  
pp. 1-11 ◽  
Author(s):  
Hai Wang ◽  
Lei Dai ◽  
Yingfeng Cai ◽  
Long Chen ◽  
Yong Zhang

Traditional salient object detection models are divided into several classes based on low-level features and contrast between pixels. In this paper, we propose a model based on a multilevel deep pyramid (MLDP), which involves fusing multiple features on different levels. Firstly, the MLDP uses the original image as the input for a VGG16 model to extract high-level features and form an initial saliency map. Next, the MLDP further extracts high-level features to form a saliency map based on a deep pyramid. Then, the MLDP obtains the salient map fused with superpixels by extracting low-level features. After that, the MLDP applies background noise filtering to the saliency map fused with superpixels in order to filter out the interference of background noise and form a saliency map based on the foreground. Lastly, the MLDP combines the saliency map fused with the superpixels with the saliency map based on the foreground, which results in the final saliency map. The MLDP is not limited to low-level features while it fuses multiple features and achieves good results when extracting salient targets. As can be seen in our experiment section, the MLDP is better than the other 7 state-of-the-art models across three different public saliency datasets. Therefore, the MLDP has superiority and wide applicability in extraction of salient targets.


PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0250782
Author(s):  
Bin Wang ◽  
Bin Xu

With the rapid development of Unmanned Aerial Vehicles, vehicle detection in aerial images plays an important role in different applications. Comparing with general object detection problems, vehicle detection in aerial images is still a challenging research topic since it is plagued by various unique factors, e.g. different camera angle, small vehicle size and complex background. In this paper, a Feature Fusion Deep-Projection Convolution Neural Network is proposed to enhance the ability to detect small vehicles in aerial images. The backbone of the proposed framework utilizes a novel residual block named stepwise res-block to explore high-level semantic features as well as conserve low-level detail features at the same time. A specially designed feature fusion module is adopted in the proposed framework to further balance the features obtained from different levels of the backbone. A deep-projection deconvolution module is used to minimize the impact of the information contamination introduced by down-sampling/up-sampling processes. The proposed framework has been evaluated by UCAS-AOD, VEDAI, and DOTA datasets. According to the evaluation results, the proposed framework outperforms other state-of-the-art vehicle detection algorithms for aerial images.


2020 ◽  
Author(s):  
Haider Al-Tahan ◽  
Yalda Mohsenzadeh

AbstractWhile vision evokes a dense network of feedforward and feedback neural processes in the brain, visual processes are primarily modeled with feedforward hierarchical neural networks, leaving the computational role of feedback processes poorly understood. Here, we developed a generative autoencoder neural network model and adversarially trained it on a categorically diverse data set of images. We hypothesized that the feedback processes in the ventral visual pathway can be represented by reconstruction of the visual information performed by the generative model. We compared representational similarity of the activity patterns in the proposed model with temporal (magnetoencephalography) and spatial (functional magnetic resonance imaging) visual brain responses. The proposed generative model identified two segregated neural dynamics in the visual brain. A temporal hierarchy of processes transforming low level visual information into high level semantics in the feedforward sweep, and a temporally later dynamics of inverse processes reconstructing low level visual information from a high level latent representation in the feedback sweep. Our results append to previous studies on neural feedback processes by presenting a new insight into the algorithmic function and the information carried by the feedback processes in the ventral visual pathway.Author summaryIt has been shown that the ventral visual cortex consists of a dense network of regions with feedforward and feedback connections. The feedforward path processes visual inputs along a hierarchy of cortical areas that starts in early visual cortex (an area tuned to low level features e.g. edges/corners) and ends in inferior temporal cortex (an area that responds to higher level categorical contents e.g. faces/objects). Alternatively, the feedback connections modulate neuronal responses in this hierarchy by broadcasting information from higher to lower areas. In recent years, deep neural network models which are trained on object recognition tasks achieved human-level performance and showed similar activation patterns to the visual brain. In this work, we developed a generative neural network model that consists of encoding and decoding sub-networks. By comparing this computational model with the human brain temporal (magnetoencephalography) and spatial (functional magnetic resonance imaging) response patterns, we found that the encoder processes resemble the brain feedforward processing dynamics and the decoder shares similarity with the brain feedback processing dynamics. These results provide an algorithmic insight into the spatiotemporal dynamics of feedforward and feedback processes in biological vision.


Author(s):  
Guoliang Fan ◽  
Yi Ding

Semantic event detection is an active and interesting research topic in the field of video mining. The major challenge is the semantic gap between low-level features and high-level semantics. In this chapter, we will advance a new sports video mining framework where a hybrid generative-discriminative approach is used for event detection. Specifically, we propose a three-layer semantic space by which event detection is converted into two inter-related statistical inference procedures that involve semantic analysis at different levels. The first is to infer the mid-level semantic structures from the low-level visual features via generative models, which can serve as building blocks of high-level semantic analysis. The second is to detect high-level semantics from mid-level semantic structures using discriminative models, which are of direct interests to users. In this framework we can explicitly represent and detect semantics at different levels. The use of generative and discriminative approaches in two different stages is proved to be effective and appropriate for event detection in sports video. The experimental results from a set of American football video data demonstrate that the proposed framework offers promising results compared with traditional approaches.


Sensors ◽  
2020 ◽  
Vol 20 (14) ◽  
pp. 4021 ◽  
Author(s):  
Mustansar Fiaz ◽  
Arif Mahmood ◽  
Soon Ki Jung

We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.


Sign in / Sign up

Export Citation Format

Share Document