scholarly journals Semantic representation for visual reasoning

2019 ◽  
Vol 277 ◽  
pp. 02006 ◽  
Author(s):  
Xubin Ni ◽  
Lirong Yin ◽  
Xiaobing Chen ◽  
Shan Liu ◽  
Bo Yang ◽  
...  

In the field of visual reasoning, image features are widely used as the input of neural networks to get answers. However, image features are too redundant to learn accurate characterizations for regular networks. While in human reasoning, abstract description is usually constructed to avoid irrelevant details. Inspired by this, a higher-level representation named semantic representation is introduced in this paper to make visual reasoning more efficient. The idea of the Gram matrix used in the neural style transfer research is transferred here to build a relation matrix which enables the related information between objects to be better represented. The model using semantic representation as input outperforms the same model using image features as input which verifies that more accurate results can be obtained through the introduction of high-level semantic representation in the field of visual reasoning.

Author(s):  
Bo Wang ◽  
Xiaoting Yu ◽  
Chengeng Huang ◽  
Qinghong Sheng ◽  
Yuanyuan Wang ◽  
...  

The excellent feature extraction ability of deep convolutional neural networks (DCNNs) has been demonstrated in many image processing tasks, by which image classification can achieve high accuracy with only raw input images. However, the specific image features that influence the classification results are not readily determinable and what lies behind the predictions is unclear. This study proposes a method combining the Sobel and Canny operators and an Inception module for ship classification. The Sobel and Canny operators obtain enhanced edge features from the input images. A convolutional layer is replaced with the Inception module, which can automatically select the proper convolution kernel for ship objects in different image regions. The principle is that the high-level features abstracted by the DCNN, and the features obtained by multi-convolution concatenation of the Inception module must ultimately derive from the edge information of the preprocessing input images. This indicates that the classification results are based on the input edge features, which indirectly interpret the classification results to some extent. Experimental results show that the combination of the edge features and the Inception module improves DCNN ship classification performance. The original model with the raw dataset has an average accuracy of 88.72%, while when using enhanced edge features as input, it achieves the best performance of 90.54% among all models. The model that replaces the fifth convolutional layer with the Inception module has the best performance of 89.50%. It performs close to VGG-16 on the raw dataset and is significantly better than other deep neural networks. The results validate the functionality and feasibility of the idea posited.


Sensors ◽  
2020 ◽  
Vol 20 (5) ◽  
pp. 1495
Author(s):  
Hyun Kwon ◽  
Hyunsoo Yoon ◽  
Ki-Woong Park

Mobile devices such as sensors are used to connect to the Internet and provide services to users. Web services are vulnerable to automated attacks, which can restrict mobile devices from accessing websites. To prevent such automated attacks, CAPTCHAs are widely used as a security solution. However, when a high level of distortion has been applied to a CAPTCHA to make it resistant to automated attacks, the CAPTCHA becomes difficult for a human to recognize. In this work, we propose a method for generating a CAPTCHA image that will resist recognition by machines while maintaining its recognizability to humans. The method utilizes the style transfer method, and creates a new image, called a style-plugged-CAPTCHA image, by incorporating the styles of other images while keeping the content of the original CAPTCHA. In our experiment, we used the TensorFlow machine learning library and six CAPTCHA datasets in use on actual websites. The experimental results show that the proposed scheme reduces the rate of recognition by the DeCAPTCHA system to 3.5% and 3.2% using one style image and two style images, respectively, while maintaining recognizability by humans.


2019 ◽  
Author(s):  
Marek A. Pedziwiatr ◽  
Matthias Kümmerer ◽  
Thomas S.A. Wallis ◽  
Matthias Bethge ◽  
Christoph Teufel

AbstractEye movements are vital for human vision, and it is therefore important to understand how observers decide where to look. Meaning maps (MMs), a technique to capture the distribution of semantic importance across an image, have recently been proposed to support the hypothesis that meaning rather than image features guide human gaze. MMs have the potential to be an important tool far beyond eye-movements research. Here, we examine central assumptions underlying MMs. First, we compared the performance of MMs in predicting fixations to saliency models, showing that DeepGaze II – a deep neural network trained to predict fixations based on high-level features rather than meaning – outperforms MMs. Second, we show that whereas human observers respond to changes in meaning induced by manipulating object-context relationships, MMs and DeepGaze II do not. Together, these findings challenge central assumptions underlying the use of MMs to measure the distribution of meaning in images.


2021 ◽  
Author(s):  
Quoc Vuong

Images are extremely effective at eliciting emotional responses in observers and have been frequently used to investigate the neural correlates of emotion. However, the image features producing this emotional response remain unclear. This study sought to use biologically inspired computational models of the brain to test the hypothesis that these emotional responses can be attributed to the estimation of arousal and valence of objects, scenes and facial expressions in the images. Convolutional neural networks were used to extract all, or various combinations, of high-level image features related to objects, scenes and facial expressions. Subsequent deep feedforward neural networks predicted the images’ arousal and valence value. The model was provided with thousands of pre-annotated images to learn the relationship between the high-level features and the images arousal and valence values. The relationship between arousal and valence was assessed by comparing models that either learnt the constructs separately or together. The results confirmed the effectiveness of using the features to predict human emotion alongside their ability to augment each other. When utilising the object, scene and facial expression information together, the model classified arousal and valence to accuracies of 88% and 87% respectively. The effectiveness of our deep neural network of emotion perception strongly suggests that these same high-level features play a critical component in producing humans’ emotional response. Moreover, performance increased across all models when arousal and valence were learnt together, suggesting a dependent relationship between these affective dimensions. These results open up numerous avenues for future work, whilst also bridging the gap between affective Neuroscience and Computer Vision.


Author(s):  
Ryan Cotterell ◽  
Hinrich Schütze

Much like sentences are composed of words, words themselves are composed of smaller units. For example, the English word questionably can be analyzed as question+ able+ ly. However, this structural decomposition of the word does not directly give us a semantic representation of the word’s meaning. Since morphology obeys the principle of compositionality, the semantics of the word can be systematically derived from the meaning of its parts. In this work, we propose a novel probabilistic model of word formation that captures both the analysis of a word w into its constituent segments and the synthesis of the meaning of w from the meanings of those segments. Our model jointly learns to segment words into morphemes and compose distributional semantic vectors of those morphemes. We experiment with the model on English CELEX data and German DErivBase (Zeller et al., 2013) data. We show that jointly modeling semantics increases both segmentation accuracy and morpheme F1 by between 3% and 5%. Additionally, we investigate different models of vector composition, showing that recurrent neural networks yield an improvement over simple additive models. Finally, we study the degree to which the representations correspond to a linguist’s notion of morphological productivity.


2021 ◽  
Vol 2 (3) ◽  
Author(s):  
Gustaf Halvardsson ◽  
Johanna Peterson ◽  
César Soto-Valero ◽  
Benoit Baudry

AbstractThe automatic interpretation of sign languages is a challenging task, as it requires the usage of high-level vision and high-level motion processing systems for providing accurate image perception. In this paper, we use Convolutional Neural Networks (CNNs) and transfer learning to make computers able to interpret signs of the Swedish Sign Language (SSL) hand alphabet. Our model consists of the implementation of a pre-trained InceptionV3 network, and the usage of the mini-batch gradient descent optimization algorithm. We rely on transfer learning during the pre-training of the model and its data. The final accuracy of the model, based on 8 study subjects and 9400 images, is 85%. Our results indicate that the usage of CNNs is a promising approach to interpret sign languages, and transfer learning can be used to achieve high testing accuracy despite using a small training dataset. Furthermore, we describe the implementation details of our model to interpret signs as a user-friendly web application.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jonathan K. George ◽  
Cesare Soci ◽  
Mario Miscuglio ◽  
Volker J. Sorger

AbstractMirror symmetry is an abundant feature in both nature and technology. Its successful detection is critical for perception procedures based on visual stimuli and requires organizational processes. Neuromorphic computing, utilizing brain-mimicked networks, could be a technology-solution providing such perceptual organization functionality, and furthermore has made tremendous advances in computing efficiency by applying a spiking model of information. Spiking models inherently maximize efficiency in noisy environments by placing the energy of the signal in a minimal time. However, many neuromorphic computing models ignore time delay between nodes, choosing instead to approximate connections between neurons as instantaneous weighting. With this assumption, many complex time interactions of spiking neurons are lost. Here, we show that the coincidence detection property of a spiking-based feed-forward neural network enables mirror symmetry. Testing this algorithm exemplary on geospatial satellite image data sets reveals how symmetry density enables automated recognition of man-made structures over vegetation. We further demonstrate that the addition of noise improves feature detectability of an image through coincidence point generation. The ability to obtain mirror symmetry from spiking neural networks can be a powerful tool for applications in image-based rendering, computer graphics, robotics, photo interpretation, image retrieval, video analysis and annotation, multi-media and may help accelerating the brain-machine interconnection. More importantly it enables a technology pathway in bridging the gap between the low-level incoming sensor stimuli and high-level interpretation of these inputs as recognized objects and scenes in the world.


2021 ◽  
Vol 13 (4) ◽  
pp. 742
Author(s):  
Jian Peng ◽  
Xiaoming Mei ◽  
Wenbo Li ◽  
Liang Hong ◽  
Bingyu Sun ◽  
...  

Scene understanding of remote sensing images is of great significance in various applications. Its fundamental problem is how to construct representative features. Various convolutional neural network architectures have been proposed for automatically learning features from images. However, is the current way of configuring the same architecture to learn all the data while ignoring the differences between images the right one? It seems to be contrary to our intuition: it is clear that some images are easier to recognize, and some are harder to recognize. This problem is the gap between the characteristics of the images and the learning features corresponding to specific network structures. Unfortunately, the literature so far lacks an analysis of the two. In this paper, we explore this problem from three aspects: we first build a visual-based evaluation pipeline of scene complexity to characterize the intrinsic differences between images; then, we analyze the relationship between semantic concepts and feature representations, i.e., the scalability and hierarchy of features which the essential elements in CNNs of different architectures, for remote sensing scenes of different complexity; thirdly, we introduce CAM, a visualization method that explains feature learning within neural networks, to analyze the relationship between scenes with different complexity and semantic feature representations. The experimental results show that a complex scene would need deeper and multi-scale features, whereas a simpler scene would need lower and single-scale features. Besides, the complex scene concept is more dependent on the joint semantic representation of multiple objects. Furthermore, we propose the framework of scene complexity prediction for an image and utilize it to design a depth and scale-adaptive model. It achieves higher performance but with fewer parameters than the original model, demonstrating the potential significance of scene complexity.


2021 ◽  
Vol 11 (2) ◽  
pp. 23
Author(s):  
Duy-Anh Nguyen ◽  
Xuan-Tu Tran ◽  
Francesca Iacopi

Deep Learning (DL) has contributed to the success of many applications in recent years. The applications range from simple ones such as recognizing tiny images or simple speech patterns to ones with a high level of complexity such as playing the game of Go. However, this superior performance comes at a high computational cost, which made porting DL applications to conventional hardware platforms a challenging task. Many approaches have been investigated, and Spiking Neural Network (SNN) is one of the promising candidates. SNN is the third generation of Artificial Neural Networks (ANNs), where each neuron in the network uses discrete spikes to communicate in an event-based manner. SNNs have the potential advantage of achieving better energy efficiency than their ANN counterparts. While generally there will be a loss of accuracy on SNN models, new algorithms have helped to close the accuracy gap. For hardware implementations, SNNs have attracted much attention in the neuromorphic hardware research community. In this work, we review the basic background of SNNs, the current state and challenges of the training algorithms for SNNs and the current implementations of SNNs on various hardware platforms.


Sign in / Sign up

Export Citation Format

Share Document