Orthogonal Representations of Object Shape and Category in Deep Convolutional Neural Networks and Human Visual Cortex

AbstractDeep Convolutional Neural Networks (CNNs) are gaining traction as the benchmark model of visual object recognition, with performance now surpassing humans. While CNNs can accurately assign one image to potentially thousands of categories, network performance could be the result of layers that are tuned to represent the visual shape of objects, rather than object category, since both are often confounded in natural images. Using two stimulus sets that explicitly dissociate shape from category, we correlate these two types of information with each layer of multiple CNNs. We also compare CNN output with fMRI activation along the human visual ventral stream by correlating artificial with biological representations. We find that CNNs encode category information independently from shape, peaking at the final fully connected layer in all tested CNN architectures. Comparing CNNs with fMRI brain data, early visual cortex (V1) and early layers of CNNs encode shape information. Anterior ventral temporal cortex encodes category information, which correlates best with the final layer of CNNs. The interaction between shape and category that is found along the human visual ventral pathway is echoed in multiple deep networks. Our results suggest CNNs represent category information independently from shape, much like the human visual system.

Download Full-text

Deep Convolutional Neural Networks Based on Image Data Augmentation for Visual Object Recognition

Intelligent Data Engineering and Automated Learning – IDEAL 2019 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-33607-3_51 ◽

2019 ◽

pp. 476-485

Author(s):

Khaoula Jayech

Keyword(s):

Neural Networks ◽

Object Recognition ◽

Convolutional Neural Networks ◽

Data Augmentation ◽

Image Data ◽

Visual Object ◽

Visual Object Recognition ◽

Deep Convolutional Neural Networks

Download Full-text

Occluded Visual Object Recognition Using Deep Conditional Generative Adversarial Nets and Feedforward Convolutional Neural Networks

2020 International Conference on Machine Vision and Image Processing (MVIP) ◽

10.1109/mvip49855.2020.9116887 ◽

2020 ◽

Author(s):

Vahid Reza Khazaie ◽

Alireza AkhavanPour ◽

Reza Ebrahimpour

Keyword(s):

Neural Networks ◽

Object Recognition ◽

Convolutional Neural Networks ◽

Visual Object ◽

Visual Object Recognition

Download Full-text

Orthogonal Representations of Object Shape and Category in Deep Convolutional Neural Networks and Human Visual Cortex

Scientific Reports ◽

10.1038/s41598-020-59175-0 ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Astrid A. Zeman ◽

J. Brendan Ritchie ◽

Stefania Bracci ◽

Hans Op de Beeck

Keyword(s):

Neural Networks ◽

Visual Cortex ◽

Convolutional Neural Networks ◽

Deep Convolutional Neural Networks ◽

Object Shape ◽

Human Visual Cortex ◽

Orthogonal Representations

Download Full-text

Integrating Flexible Normalization into Midlevel Representations of Deep Convolutional Neural Networks

Neural Computation ◽

10.1162/neco_a_01226 ◽

2019 ◽

Vol 31 (11) ◽

pp. 2138-2176 ◽

Cited By ~ 2

Author(s):

Luis Gonzalo Sánchez Giraldo ◽

Odelia Schwartz

Keyword(s):

Neural Networks ◽

Visual Cortex ◽

Convolutional Neural Networks ◽

Neural Responses ◽

Spatial Normalization ◽

Deep Convolutional Neural Networks ◽

Cortical Areas ◽

Classical Receptive Field ◽

Divisive Normalization ◽

Spatial Dependencies

Deep convolutional neural networks (CNNs) are becoming increasingly popular models to predict neural responses in visual cortex. However, contextual effects, which are prevalent in neural processing and in perception, are not explicitly handled by current CNNs, including those used for neural prediction. In primary visual cortex, neural responses are modulated by stimuli spatially surrounding the classical receptive field in rich ways. These effects have been modeled with divisive normalization approaches, including flexible models, where spatial normalization is recruited only to the degree that responses from center and surround locations are deemed statistically dependent. We propose a flexible normalization model applied to midlevel representations of deep CNNs as a tractable way to study contextual normalization mechanisms in midlevel cortical areas. This approach captures nontrivial spatial dependencies among midlevel features in CNNs, such as those present in textures and other visual stimuli, that arise from tiling high-order features geometrically. We expect that the proposed approach can make predictions about when spatial normalization might be recruited in midlevel cortical areas. We also expect this approach to be useful as part of the CNN tool kit, therefore going beyond more restrictive fixed forms of normalization.

Download Full-text

Facial Expressions Recognition for Human–Robot Interaction Using Deep Convolutional Neural Networks with Rectified Adam Optimizer

Sensors ◽

10.3390/s20082393 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2393 ◽

Cited By ~ 2

Author(s):

Daniel Octavian Melinte ◽

Luige Vladareanu

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Convolutional Neural Networks ◽

Human Robot Interaction ◽

Fine Tuning ◽

Visual Object ◽

Single Shot ◽

Deep Convolutional Neural Networks

The interaction between humans and an NAO robot using deep convolutional neural networks (CNN) is presented in this paper based on an innovative end-to-end pipeline method that applies two optimized CNNs, one for face recognition (FR) and another one for the facial expression recognition (FER) in order to obtain real-time inference speed for the entire process. Two different models for FR are considered, one known to be very accurate, but has low inference speed (faster region-based convolutional neural network), and one that is not as accurate but has high inference speed (single shot detector convolutional neural network). For emotion recognition transfer learning and fine-tuning of three CNN models (VGG, Inception V3 and ResNet) has been used. The overall results show that single shot detector convolutional neural network (SSD CNN) and faster region-based convolutional neural network (Faster R-CNN) models for face detection share almost the same accuracy: 97.8% for Faster R-CNN on PASCAL visual object classes (PASCAL VOCs) evaluation metrics and 97.42% for SSD Inception. In terms of FER, ResNet obtained the highest training accuracy (90.14%), while the visual geometry group (VGG) network had 87% accuracy and Inception V3 reached 81%. The results show improvements over 10% when using two serialized CNN, instead of using only the FER CNN, while the recent optimization model, called rectified adaptive moment optimization (RAdam), lead to a better generalization and accuracy improvement of 3%-4% on each emotion recognition CNN.

Download Full-text

Optimization of FireNet for Liver Lesion Classification

Electronics ◽

10.3390/electronics9081237 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1237

Author(s):

Gedeon Kashala Kabe ◽

Yuqing Song ◽

Zhe Liu

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Liver Lesion ◽

Superior Performance ◽

Visual Object ◽

Visual Object Recognition ◽

Residual Function ◽

Learning Techniques ◽

Model Size

In recent years, deep learning techniques, and in particular convolutional neural networks (CNNs) methods have demonstrated a superior performance in image classification and visual object recognition. In this work, we propose a classification of four types of liver lesions, namely, hepatocellular carcinoma, metastases, hemangiomas, and healthy tissues using convolutional neural networks with a succinct model called FireNet. We improved speed for quick classification and decreased the model size and the number of parameters by using fire modules from SqueezeNet. We have used bypass connection by adding it around Fire modules for learning a residual function between input and output, and to solve the vanishing gradient problem. We have proposed a new Particle Swarm Optimization (NPSO) to optimize the network parameters in order to further boost the performance of the proposed FireNet. The experimental results show that the parameters of FireNet are 9.5 times smaller than GoogLeNet, 51.6 times smaller than AlexNet, and 75.8 smaller than ResNet. The size of FireNet is reduced 16.6 times smaller than GoogLeNet, 75 times smaller than AlexNet and 76.6 times smaller than ResNet. The final accuracy of our proposed FireNet model was 89.2%.

Download Full-text

Local features and global shape information in object classification by deep convolutional neural networks

Vision Research ◽

10.1016/j.visres.2020.04.003 ◽

2020 ◽

Vol 172 ◽

pp. 46-61 ◽

Cited By ~ 1

Author(s):

Nicholas Baker ◽

Hongjing Lu ◽

Gennady Erlikhman ◽

Philip J. Kellman

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Object Classification ◽

Local Features ◽

Deep Convolutional Neural Networks ◽

Shape Information ◽

Global Shape

Download Full-text

Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex

Communications Biology ◽

10.1038/s42003-018-0110-y ◽

2018 ◽

Vol 1 (1) ◽

Cited By ~ 18

Author(s):

Ilya Kuzovkin ◽

Raul Vicente ◽

Mathilde Petton ◽

Jean-Philippe Lachaux ◽

Monica Baciu ◽

...

Keyword(s):

Neural Networks ◽

Visual Cortex ◽

Convolutional Neural Networks ◽

Gamma Band ◽

Deep Convolutional Neural Networks ◽

Human Visual Cortex ◽

Gamma Band Activity ◽

Band Activity

Download Full-text

Recurrent convolutional neural networks: a better model of biological object recognition

10.1101/133330 ◽

2017 ◽

Cited By ~ 3

Author(s):

Courtney J. Spoerer ◽

Patrick McClure ◽

Nikolaus Kriegeskorte

Keyword(s):

Neural Networks ◽

Object Recognition ◽

Convolutional Neural Networks ◽

Recurrent Neural Networks ◽

Feedforward Control ◽

Recognition Performance ◽

Feedforward Neural Networks ◽

Visual Object ◽

Visual Object Recognition ◽

Feedback Connections

Feedforward neural networks provide the dominant model of how the brain performs visual object recognition. However, these networks lack the lateral and feedback connections, and the resulting recurrent neuronal dynamics, of the ventral visual pathway in the human and nonhuman primate brain. Here we investigate recurrent convolutional neural networks with bottom-up (B), lateral (L), and top-down (T) connections. Combining these types of connections yields four architectures (B, BT, BL, and BLT), which we systematically test and compare. We hypothesized that recurrent dynamics might improve recognition performance in the challenging scenario of partial occlusion. We introduce two novel occluded object recognition tasks to test the efficacy of the models, digit clutter (where multiple target digits occlude one another) and digit debris (where target digits are occluded by digit fragments). We find that recurrent neural networks outperform feedforward control models (approximately matched in parametric complexity) at recognising objects, both in the absence of occlusion and in all occlusion conditions. Recurrent networks were also found to be more robust to the inclusion of additive Gaussian noise. Recurrent neural networks are better in two respects: (1) they are more neurobiologically realistic than their feedforward counterparts; (2) they are better in terms of their ability to recognise objects, especially under challenging conditions. This work shows that computer vision can benefit from using recurrent convolutional architectures and suggests that the ubiquitous recurrent connections in biological brains are essential for task performance.

Download Full-text