scholarly journals A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

2021 ◽  
Author(s):  
Manh-Duy Nguyen ◽  
Binh T. Nguyen ◽  
Cathal Gurrin

Conventional approaches to image-text retrieval mainly focus on indexing visual objects appearing in pictures but ignore the interactions between these objects. Such objects occurrences and interactions are equivalently useful and important in this field as they are usually mentioned in the text. Scene graph presentation is a suitable method for the image-text matching challenge and obtained good results due to its ability to capture the inter-relationship information. Both images and text are represented in scene graph levels and formulate the retrieval challenge as a scene graph matching challenge. In this paper, we introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method by integrating an extra graph convolution network to capture the general information of a graph. Specifically, for a pair of scene graphs of an image and its caption, two separate models are used to learn the features of each graph’s nodes and edges. Then a Siamese-structure graph convolution model is employed to embed graphs into vector forms. We finally combine the graph-level and the vector-level to calculate the similarity of this image-text pair. The empirical experiments show that our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset. Our implementation code can be found at https://github.com/m2man/LGSGM.

2020 ◽  
Vol 7 (4) ◽  
pp. 445-461
Author(s):  
Anna V. Beloedova ◽  
◽  
Evgeny A. Kozhemyakin ◽  
Yan I. Tyazhlov ◽  
◽  
...  

The paper discusses several advantages of multimodal approach to the analysis of nature and specifics of media audience’s reception of texts. The authors base their ideas on the principally multimodal nature of media communication, which reflects in the way the recipients interpret texts. Thus, various factors impact on the interpretation of the initialized text. Moreover, such factors predominately include other semiotic resources that are not formally affiliated with the basic text. In the research, the hypothesis was about the impact of verbal comments on interpretation of visual objects by recipients. According to this goal, the authors selected two groups of respondents — a control and an experimental one, which were offered a photograph and were asked to describe it verbally in a free manner. The experimental group of respondents was offered the same photograph with motivating verbal comment, including the general information about the origins and topics of the photograph. The authors compared the results of the both groups descriptions by matching verbalized categories of representation, individual evaluations of the photograph and “the syntax” of verbal description of the visual media text. The research results proved the general supposition: the multimodal approach, being aimed at finding, describing and explicating the meaning effects of semiotic ensembles, contributes to understanding the features of interpretation of visual objects under the influence of other semiotic (here — verbal) resources. Thus, the results show that the interpretation of visual objects is motivated by the verbal comment: it topicalizes and contextualizes the visual reception and interpretation of the basic message. In the conclusion, the authors define the perspectives both of the research and the multimodal approach to media texts study.


Author(s):  
Chi Chung Ko ◽  
Chang Dong Cheng

Of all the human perceptions, two of the most important ones are perhaps vision and sound, for which we have developed highly specialized sensors over millions of years of evolution. The creation of a realistic virtual world therefore calls for the development of realistic 3D virtual objects and sceneries supplemented by associated sounds and audio signals. The development of 3D visual objects is of course the main domain of Java 3D. However, as in watching a movie, it is also essential to have realistic sound and audio in some applications. In this chapter, we will discuss how sound and audio can be added and supported by Java 3D. The Java 3D API provides some functionalities to add and control sound in a 3D spatialized manner. It also allows the rendering of aural characteristics for the modeling of real world, synthetic or special acoustical effects (Warren, 2006). From a programming point of view, the inclusion of sound is similar to the addition of light. Both are the results of adding nodes to the scene graph for the virtual world. The addition of a sound node can be accomplished by the abstract Sound class, under which there are three subclasses on BackgroundSound, PointSound, and ConeSound (Osawa, Asai, Takase, & Saito, 2001). Multiple sound sources, each with a reference sound file and associated methods for control and activation, can be included in the scene graph. The relevant sound will become audible whenever the scheduling bound associated with the sound node intersects the activation volume of the listener. By creating an AuralAttributes object and attaching it to a SoundScape leaf node for a certain sound in the scene graph, we can also specify the use of certain acoustical effects in the rendering of the sound. This is done through using the various methods to change important acoustic parameters in the Aura lAttributes object.


Author(s):  
Nihel Kooli ◽  
Abdel Belaid ◽  
Aurelie Joseph ◽  
Vincent Poulain D'Andecy

2020 ◽  
Author(s):  
koji takeda ◽  
kanji tanaka

Graph-based scene model has been receiving increasing attention as a flexible and descriptive scene model for visual robot self-localization. In a typical self-localization application, objects, object features, and object relationship in an environment map are described respectively by nodes, node features, and edges in a scene graph, which are then matched against a query scene graph by a graph matching engine. However, its overhead for computation, storage, and communication, is proportional to the number and feature dimensionality of graph nodes, and can be significant in large-scale applications. In this study, we observe that graph-convolutional neural network (GCN) has a potential to become an efficient tool to train and predict with a graph matching engine. However, it is non-trivial to translate a given visual feature to a proper graph feature that contributes to good self-localization performance. To address this issue, we introduce a new knowledge transfer (KT) framework, which introduces an arbitrary self-localization model as a teacher to train the student, GCN-based self-localization system. Our KT framework enables lightweight storage/communication by using compact teacher's output signals as training data. Results on RobotCar datasets show that the proposed method outperforms existing comparing methods as well as the teacher self-localization system.


Sign in / Sign up

Export Citation Format

Share Document