scholarly journals Limits to visual representational correspondence between convolutional neural networks and the human brain

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Yaoda Xu ◽  
Maryam Vaziri-Pashkam

AbstractConvolutional neural networks (CNNs) are increasingly used to model human vision due to their high object categorization capabilities and general correspondence with human brain responses. Here we evaluate the performance of 14 different CNNs compared with human fMRI responses to natural and artificial images using representational similarity analysis. Despite the presence of some CNN-brain correspondence and CNNs’ impressive ability to fully capture lower level visual representation of real-world objects, we show that CNNs do not fully capture higher level visual representations of real-world objects, nor those of artificial objects, either at lower or higher levels of visual representations. The latter is particularly critical, as the processing of both real-world and artificial visual stimuli engages the same neural circuits. We report similar results regardless of differences in CNN architecture, training, or the presence of recurrent processing. This indicates some fundamental differences exist in how the brain and CNNs represent visual information.

2020 ◽  
Vol 34 (04) ◽  
pp. 5281-5288 ◽  
Author(s):  
Satoshi Nishida ◽  
Yusuke Nakano ◽  
Antoine Blanc ◽  
Naoya Maeda ◽  
Masataka Kado ◽  
...  

The human brain can effectively learn a new task from a small number of samples, which indicates that the brain can transfer its prior knowledge to solve tasks in different domains. This function is analogous to transfer learning (TL) in the field of machine learning. TL uses a well-trained feature space in a specific task domain to improve performance in new tasks with insufficient training data. TL with rich feature representations, such as features of convolutional neural networks (CNNs), shows high generalization ability across different task domains. However, such TL is still insufficient in making machine learning attain generalization ability comparable to that of the human brain. To examine if the internal representation of the brain could be used to achieve more efficient TL, we introduce a method for TL mediated by human brains. Our method transforms feature representations of audiovisual inputs in CNNs into those in activation patterns of individual brains via their association learned ahead using measured brain responses. Then, to estimate labels reflecting human cognition and behavior induced by the audiovisual inputs, the transformed representations are used for TL. We demonstrate that our brain-mediated TL (BTL) shows higher performance in the label estimation than the standard TL. In addition, we illustrate that the estimations mediated by different brains vary from brain to brain, and the variability reflects the individual variability in perception. Thus, our BTL provides a framework to improve the generalization ability of machine-learning feature representations and enable machine learning to estimate human-like cognition and behavior, including individual variability.


2019 ◽  
Author(s):  
Adrien Doerig ◽  
Lynn Schmittwilken ◽  
Bilge Sayim ◽  
Mauro Manassi ◽  
Michael H. Herzog

AbstractClassically, visual processing is described as a cascade of local feedforward computations. Feedforward Convolutional Neural Networks (ffCNNs) have shown how powerful such models can be. However, using visual crowding as a well-controlled challenge, we previously showed that no classic model of vision, including ffCNNs, can explain human global shape processing (1). Here, we show that Capsule Neural Networks (CapsNets; 2), combining ffCNNs with recurrent grouping and segmentation, solve this challenge. We also show that ffCNNs and standard recurrent CNNs do not, suggesting that the grouping and segmentation capabilities of CapsNets are crucial. Furthermore, we provide psychophysical evidence that grouping and segmentation are implemented recurrently in humans, and show that CapsNets reproduce these results well. We discuss why recurrence seems needed to implement grouping and segmentation efficiently. Together, we provide mutually reinforcing psychophysical and computational evidence that a recurrent grouping and segmentation process is essential to understand the visual system and create better models that harness global shape computations.Author SummaryFeedforward Convolutional Neural Networks (ffCNNs) have revolutionized computer vision and are deeply transforming neuroscience. However, ffCNNs only roughly mimic human vision. There is a rapidly expanding body of literature investigating differences between humans and ffCNNs. Several findings suggest that, unlike humans, ffCNNs rely mostly on local visual features. Furthermore, ffCNNs lack recurrent connections, which abound in the brain. Here, we use visual crowding, a well-known psychophysical phenomenon, to investigate recurrent computations in global shape processing. Previously, we showed that no model based on the classic feedforward framework of vision can explain global effects in crowding. Here, we show that Capsule Neural Networks (CapsNets), combining ffCNNs with recurrent grouping and segmentation, solve this challenge. ffCNNs and recurrent CNNs with lateral and top-down recurrent connections do not, suggesting that grouping and segmentation are crucial for human-like global computations. Based on these results, we hypothesize that one computational function of recurrence is to efficiently implement grouping and segmentation. We provide psychophysical evidence that, indeed, grouping and segmentation is based on time consuming recurrent processes in the human brain. CapsNets reproduce these results too. Together, we provide mutually reinforcing computational and psychophysical evidence that a recurrent grouping and segmentation process is essential to understand the visual system and create better models that harness global shape computations.


2019 ◽  
Vol 35 (17) ◽  
pp. 3208-3210 ◽  
Author(s):  
Yangzhen Wang ◽  
Feng Su ◽  
Shanshan Wang ◽  
Chaojuan Yang ◽  
Yonglu Tian ◽  
...  

Abstract Motivation Functional imaging at single-neuron resolution offers a highly efficient tool for studying the functional connectomics in the brain. However, mainstream neuron-detection methods focus on either the morphologies or activities of neurons, which may lead to the extraction of incomplete information and which may heavily rely on the experience of the experimenters. Results We developed a convolutional neural networks and fluctuation method-based toolbox (ImageCN) to increase the processing power of calcium imaging data. To evaluate the performance of ImageCN, nine different imaging datasets were recorded from awake mouse brains. ImageCN demonstrated superior neuron-detection performance when compared with other algorithms. Furthermore, ImageCN does not require sophisticated training for users. Availability and implementation ImageCN is implemented in MATLAB. The source code and documentation are available at https://github.com/ZhangChenLab/ImageCN. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 6 (30) ◽  
pp. eaba7830
Author(s):  
Laurianne Cabrera ◽  
Judit Gervain

Speech perception is constrained by auditory processing. Although at birth infants have an immature auditory system and limited language experience, they show remarkable speech perception skills. To assess neonates’ ability to process the complex acoustic cues of speech, we combined near-infrared spectroscopy (NIRS) and electroencephalography (EEG) to measure brain responses to syllables differing in consonants. The syllables were presented in three conditions preserving (i) original temporal modulations of speech [both amplitude modulation (AM) and frequency modulation (FM)], (ii) both fast and slow AM, but not FM, or (iii) only the slowest AM (<8 Hz). EEG responses indicate that neonates can encode consonants in all conditions, even without the fast temporal modulations, similarly to adults. Yet, the fast and slow AM activate different neural areas, as shown by NIRS. Thus, the immature human brain is already able to decompose the acoustic components of speech, laying the foundations of language learning.


2017 ◽  
Vol 22 (S3) ◽  
pp. 7123-7134 ◽  
Author(s):  
Ruifan Li ◽  
Fangxiang Feng ◽  
Ibrar Ahmad ◽  
Xiaojie Wang

Sensors ◽  
2021 ◽  
Vol 22 (1) ◽  
pp. 72
Author(s):  
Sanghun Jeon ◽  
Ahmed Elsharkawy ◽  
Mun Sang Kim

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.


2021 ◽  
Vol 15 ◽  
Author(s):  
Chi Zhang ◽  
Xiao-Han Duan ◽  
Lin-Yuan Wang ◽  
Yong-Li Li ◽  
Bin Yan ◽  
...  

Despite the remarkable similarities between convolutional neural networks (CNN) and the human brain, CNNs still fall behind humans in many visual tasks, indicating that there still exist considerable differences between the two systems. Here, we leverage adversarial noise (AN) and adversarial interference (AI) images to quantify the consistency between neural representations and perceptual outcomes in the two systems. Humans can successfully recognize AI images as the same categories as their corresponding regular images but perceive AN images as meaningless noise. In contrast, CNNs can recognize AN images similar as corresponding regular images but classify AI images into wrong categories with surprisingly high confidence. We use functional magnetic resonance imaging to measure brain activity evoked by regular and adversarial images in the human brain, and compare it to the activity of artificial neurons in a prototypical CNN—AlexNet. In the human brain, we find that the representational similarity between regular and adversarial images largely echoes their perceptual similarity in all early visual areas. In AlexNet, however, the neural representations of adversarial images are inconsistent with network outputs in all intermediate processing layers, providing no neural foundations for the similarities at the perceptual level. Furthermore, we show that voxel-encoding models trained on regular images can successfully generalize to the neural responses to AI images but not AN images. These remarkable differences between the human brain and AlexNet in representation-perception association suggest that future CNNs should emulate both behavior and the internal neural presentations of the human brain.


2019 ◽  
Author(s):  
K. Seeliger ◽  
R. P. Sommers ◽  
U. Güçlü ◽  
S. E. Bosch ◽  
M. A. J. van Gerven

AbstractVisual and auditory representations in the human brain have been studied with encoding, decoding and reconstruction models. Representations from convolutional neural networks have been used as explanatory models for these stimulus-induced hierarchical brain activations. However, none of the fMRI datasets currently available has adequate amounts of data for sufficiently sampling their representations. We recorded a densely sampled large fMRI dataset (TR=700 ms) in a single individual exposed to spatiotemporal visual and auditory naturalistic stimuli (30 episodes of BBC’s Doctor Who). The data consists of 120.830 whole-brain volumes (approx. 23 h) of single-presentation data (full episodes, training set) and 1.178 volumes (11 min) of repeated narrative short episodes (test set, 22 repetitions), recorded with fixation over a period of six months. This rich dataset can be used widely to study the way the brain represents audiovisual input across its sensory hierarchies.


Sign in / Sign up

Export Citation Format

Share Document