Capsule Networks as Recurrent Models ofGrouping and Segmentation

AbstractClassically, visual processing is described as a cascade of local feedforward computations. Feedforward Convolutional Neural Networks (ffCNNs) have shown how powerful such models can be. However, using visual crowding as a well-controlled challenge, we previously showed that no classic model of vision, including ffCNNs, can explain human global shape processing (1). Here, we show that Capsule Neural Networks (CapsNets; 2), combining ffCNNs with recurrent grouping and segmentation, solve this challenge. We also show that ffCNNs and standard recurrent CNNs do not, suggesting that the grouping and segmentation capabilities of CapsNets are crucial. Furthermore, we provide psychophysical evidence that grouping and segmentation are implemented recurrently in humans, and show that CapsNets reproduce these results well. We discuss why recurrence seems needed to implement grouping and segmentation efficiently. Together, we provide mutually reinforcing psychophysical and computational evidence that a recurrent grouping and segmentation process is essential to understand the visual system and create better models that harness global shape computations.Author SummaryFeedforward Convolutional Neural Networks (ffCNNs) have revolutionized computer vision and are deeply transforming neuroscience. However, ffCNNs only roughly mimic human vision. There is a rapidly expanding body of literature investigating differences between humans and ffCNNs. Several findings suggest that, unlike humans, ffCNNs rely mostly on local visual features. Furthermore, ffCNNs lack recurrent connections, which abound in the brain. Here, we use visual crowding, a well-known psychophysical phenomenon, to investigate recurrent computations in global shape processing. Previously, we showed that no model based on the classic feedforward framework of vision can explain global effects in crowding. Here, we show that Capsule Neural Networks (CapsNets), combining ffCNNs with recurrent grouping and segmentation, solve this challenge. ffCNNs and recurrent CNNs with lateral and top-down recurrent connections do not, suggesting that grouping and segmentation are crucial for human-like global computations. Based on these results, we hypothesize that one computational function of recurrence is to efficiently implement grouping and segmentation. We provide psychophysical evidence that, indeed, grouping and segmentation is based on time consuming recurrent processes in the human brain. CapsNets reproduce these results too. Together, we provide mutually reinforcing computational and psychophysical evidence that a recurrent grouping and segmentation process is essential to understand the visual system and create better models that harness global shape computations.

Download Full-text

Crowding Reveals Fundamental Differences in Local vs. Global Processing in Humans and Machines

10.1101/744268 ◽

2019 ◽

Cited By ~ 1

Author(s):

A. Doerig ◽

A. Bornet ◽

O. H. Choung ◽

M. H. Herzog

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Visual System ◽

Human Visual System ◽

State Of The Art ◽

Global Processing ◽

Specific Probe ◽

Shape Information ◽

Global Shape ◽

Visual Crowding

AbstractFeedforward Convolutional Neural Networks (ffCNNs) have become state-of-the-art models both in computer vision and neuroscience. However, human-like performance of ffCNNs does not necessarily imply human-like computations. Previous studies have suggested that current ffCNNs do not make use of global shape information. However, it is currently unclear whether this reflects fundamental differences between ffCNN and human processing or is merely an artefact of how ffCNNs are trained. Here, we use visual crowding as a well-controlled, specific probe to test global shape computations. Our results provide evidence that ffCNNs cannot produce human-like global shape computations for principled architectural reasons. We lay out approaches that may address shortcomings of ffCNNs to provide better models of the human visual system.

Download Full-text

A Visual System of Citrus Picking Robot Using Convolutional Neural Networks

2018 5th International Conference on Systems and Informatics (ICSAI) ◽

10.1109/icsai.2018.8599325 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yan-Ping Liu ◽

Chang-Hui Yang ◽

Huang Ling ◽

Shingo Mabu ◽

Takashi Kuremoto

Keyword(s):

Neural Networks ◽

Visual System ◽

Convolutional Neural Networks

Download Full-text

Recognition of Handwritten Digit using Convolutional Neural Network (CNN)

Global Journal of Computer Science and Technology ◽

10.34257/gjcstdvol19is2pg27 ◽

2019 ◽

pp. 27-33 ◽

Cited By ~ 3

Author(s):

Md. Anwar Hossain ◽

Md. Mohon Ali

Keyword(s):

Neural Network ◽

Neural Networks ◽

Computer Vision ◽

Convolutional Neural Network ◽

Convolutional Neural Networks ◽

Human Vision ◽

Training Time ◽

The World ◽

Handwritten Digit

Humans can see and visually sense the world around them by using their eyes and brains. Computer vision works on enabling computers to see and process images in the same way that human vision does. Several algorithms developed in the area of computer vision to recognize images. The goal of our work will be to create a model that will be able to identify and determine the handwritten digit from its image with better accuracy. We aim to complete this by using the concepts of Convolutional Neural Network and MNIST dataset. We will also show how MatConvNet can be used to implement our model with CPU training as well as less training time. Though the goal is to create a model which can recognize the digits, we can extend it for letters and then a person’s handwriting. Through this work, we aim to learn and practically apply the concepts of Convolutional Neural Networks.

Download Full-text

Detection of Diabetic Retinopathy Based on Convolutional Neural Networks: A Review

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v8i330200 ◽

2021 ◽

pp. 1-15

Author(s):

Halbast Rashid Ismael ◽

Adnan Mohsin Abdulazeez ◽

Dathar Abas Hasan

Keyword(s):

Neural Networks ◽

Diabetic Retinopathy ◽

Convolutional Neural Networks ◽

Vision Loss ◽

Human Vision ◽

Early Screening ◽

Slowing Down ◽

Manual Diagnostics ◽

Patients With Diabetes ◽

Paper Review

A major cause of human vision loss worldwide is Diabetic retinopathy (DR). The disease requires early screening for slowing down the progress. However, in low-resource settings where few ophthalmologists are available to care for all patients with diabetes, the clinical diagnosis of DR will be a considerable challenge. This paper, review the most recent studies on the detection of DR by using one of the efficient algorithms of deep learning, which is Convolutional Neural Networks (CNN), which highly used to detect DR features from retinal images. CNNs approach to DR detection saves time and expense, and is more efficient and accurate than manual diagnostics. Therefore, CNN is essential and beneficial for DR detection.

Download Full-text

Joint Representation of Color and Shape in Convolutional Neural Networks: A Stimulus-rich Network Perspective

10.1101/2020.08.11.246223 ◽

2020 ◽

Author(s):

JohnMark Taylor ◽

Yaoda Xu

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Visual Features ◽

Primate Brain ◽

Neuroscience Research ◽

Viable Solution ◽

Shape Processing ◽

Interactive Feature ◽

First Time ◽

Joint Representation

AbstractTo interact with real-world objects, any effective visual system must jointly code the unique features defining each object. Despite decades of neuroscience research, we still lack a firm grasp on how the primate brain binds visual features. Here we apply a novel network-based stimulus-rich representational similarity approach to study color and shape binding in five convolutional neural networks (CNNs) with varying architecture, depth, and presence/absence of recurrent processing. All CNNs showed near-orthogonal color and shape processing in early layers, but increasingly interactive feature coding in higher layers, with this effect being much stronger for networks trained for object classification than untrained networks. These results characterize for the first time how multiple visual features are coded together in CNNs. The approach developed here can be easily implemented to characterize whether a similar coding scheme may serve as a viable solution to the binding problem in the primate brain.

Download Full-text

What do adversarial images tell us about human vision?

10.1101/2020.02.25.964361 ◽

2020 ◽

Author(s):

Marin Dujmović ◽

Gaurav Malhotra ◽

Jeffrey Bowers

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Human Vision ◽

Deep Convolutional Neural Networks ◽

High Profile ◽

Primate Vision

AbstractDeep convolutional neural networks (DCNNs) are frequently described as promising models of human and primate vision. An obvious challenge to this claim is the existence of adversarial images that fool DCNNs but are uninterpretable to humans. However, recent research has suggested that there may be similarities in how humans and DCNNs interpret these seemingly nonsense images. In this study, we reanalysed data from a high-profile paper and conducted four experiments controlling for different ways in which these images can be generated and selected. We show that agreement between humans and DCNNs is much weaker and more variable than previously reported, and that the weak agreement is contingent on the choice of adversarial images and the design of the experiment. Indeed, it is easy to generate images with no agreement. We conclude that adversarial images still challenge the claim that DCNNs constitute promising models of human and primate vision.

Download Full-text

Local features and global shape information in object classification by deep convolutional neural networks

Vision Research ◽

10.1016/j.visres.2020.04.003 ◽

2020 ◽

Vol 172 ◽

pp. 46-61 ◽

Cited By ~ 1

Author(s):

Nicholas Baker ◽

Hongjing Lu ◽

Gennady Erlikhman ◽

Philip J. Kellman

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Object Classification ◽

Local Features ◽

Deep Convolutional Neural Networks ◽

Shape Information ◽

Global Shape

Download Full-text

Limits to visual representational correspondence between convolutional neural networks and the human brain

Nature Communications ◽

10.1038/s41467-021-22244-7 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yaoda Xu ◽

Maryam Vaziri-Pashkam

Keyword(s):

Neural Networks ◽

Human Brain ◽

Convolutional Neural Networks ◽

Real World ◽

Visual Information ◽

Visual Representations ◽

Human Vision ◽

Brain Responses ◽

General Correspondence ◽

The Brain

AbstractConvolutional neural networks (CNNs) are increasingly used to model human vision due to their high object categorization capabilities and general correspondence with human brain responses. Here we evaluate the performance of 14 different CNNs compared with human fMRI responses to natural and artificial images using representational similarity analysis. Despite the presence of some CNN-brain correspondence and CNNs’ impressive ability to fully capture lower level visual representation of real-world objects, we show that CNNs do not fully capture higher level visual representations of real-world objects, nor those of artificial objects, either at lower or higher levels of visual representations. The latter is particularly critical, as the processing of both real-world and artificial visual stimuli engages the same neural circuits. We report similar results regardless of differences in CNN architecture, training, or the presence of recurrent processing. This indicates some fundamental differences exist in how the brain and CNNs represent visual information.

Download Full-text

Convolutional neural networks trained with a developmental sequence of blurry to clear images reveal core differences between face and object processing

10.1101/2021.05.25.444835 ◽

2021 ◽

Author(s):

Hojin Jang ◽

Frank Tong

Keyword(s):

Neural Networks ◽

Object Recognition ◽

Convolutional Neural Networks ◽

Frequency Tuning ◽

Human Vision ◽

Object Processing ◽

Training Images ◽

Early Experiences ◽

Spatial Frequencies ◽

Blurry Vision

Although convolutional neural networks (CNNs) provide a promising model for understanding human vision, most CNNs lack robustness to challenging viewing conditions such as image blur, whereas human vision is much more reliable. Might robustness to blur be attributable to vision during infancy, given that acuity is initially poor but improves considerably over the first several months of life? Here, we evaluated the potential consequences of such early experiences by training CNN models on face and object recognition tasks while gradually reducing the amount of blur applied to the training images. For CNNs trained on blurry to clear faces, we observed sustained robustness to blur, consistent with a recent report by Vogelsang and colleagues (2018). By contrast, CNNs trained with blurry to clear objects failed to retain robustness to blur. Further analyses revealed that the spatial frequency tuning of the two CNNs was profoundly different. The blurry to clear face-trained network successfully retained a preference for low spatial frequencies, whereas the blurry to clear object-trained CNN exhibited a progressive shift toward higher spatial frequencies. Our findings provide novel computational evidence showing how face recognition, unlike object recognition, allows for more holistic processing. Moreover, our results suggest that blurry vision during infancy is insufficient to account for the robustness of adult vision to blurry objects.

Download Full-text

Meaning maps and saliency models based on deep convolutional neural networks are insensitive to image meaning when predicting human fixations

10.1101/840256 ◽

2019 ◽

Author(s):

Marek A. Pedziwiatr ◽

Matthias Kümmerer ◽

Thomas S.A. Wallis ◽

Matthias Bethge ◽

Christoph Teufel

Keyword(s):

Neural Network ◽

Neural Networks ◽

Eye Movements ◽

Convolutional Neural Networks ◽

Deep Neural Network ◽

Image Features ◽

Human Vision ◽

Deep Convolutional Neural Networks ◽

High Level

AbstractEye movements are vital for human vision, and it is therefore important to understand how observers decide where to look. Meaning maps (MMs), a technique to capture the distribution of semantic importance across an image, have recently been proposed to support the hypothesis that meaning rather than image features guide human gaze. MMs have the potential to be an important tool far beyond eye-movements research. Here, we examine central assumptions underlying MMs. First, we compared the performance of MMs in predicting fixations to saliency models, showing that DeepGaze II – a deep neural network trained to predict fixations based on high-level features rather than meaning – outperforms MMs. Second, we show that whereas human observers respond to changes in meaning induced by manipulating object-context relationships, MMs and DeepGaze II do not. Together, these findings challenge central assumptions underlying the use of MMs to measure the distribution of meaning in images.

Download Full-text