scholarly journals Position Focused Attention Network for Image-Text Matching

Author(s):  
Yaxiong Wang ◽  
Hao Yang ◽  
Xueming Qian ◽  
Lin Ma ◽  
Jing Lu ◽  
...  

Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fine understanding of both modalities. In this paper, we propose a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views. In this work, we integrate the object position clue to enhance the visual-text joint-embedding learning. We first split the images into blocks, by which we infer the relative position of region in the image. Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical news dataset (Tencent-News) to validate the practical application value of proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method can achieve the state-of-art performance on all of these three datasets.

2021 ◽  
Author(s):  
Yang Liu ◽  
Huaqiu Wang ◽  
Fanyang Meng ◽  
Mengyuan Liu ◽  
Hong Liu

Sensors ◽  
2021 ◽  
Vol 21 (15) ◽  
pp. 5172
Author(s):  
Yuying Dong ◽  
Liejun Wang ◽  
Shuli Cheng ◽  
Yongming Li

Considerable research and surveys indicate that skin lesions are an early symptom of skin cancer. Segmentation of skin lesions is still a hot research topic. Dermatological datasets in skin lesion segmentation tasks generated a large number of parameters when data augmented, limiting the application of smart assisted medicine in real life. Hence, this paper proposes an effective feedback attention network (FAC-Net). The network is equipped with the feedback fusion block (FFB) and the attention mechanism block (AMB), through the combination of these two modules, we can obtain richer and more specific feature mapping without data enhancement. Numerous experimental tests were given by us on public datasets (ISIC2018, ISBI2017, ISBI2016), and a good deal of metrics like the Jaccard index (JA) and Dice coefficient (DC) were used to evaluate the results of segmentation. On the ISIC2018 dataset, we obtained results for DC equal to 91.19% and JA equal to 83.99%, compared with the based network. The results of these two main metrics were improved by more than 1%. In addition, the metrics were also improved in the other two datasets. It can be demonstrated through experiments that without any enhancements of the datasets, our lightweight model can achieve better segmentation performance than most deep learning architectures.


Author(s):  
Lianli Gao ◽  
Pengpeng Zeng ◽  
Jingkuan Song ◽  
Yuan-Fang Li ◽  
Wu Liu ◽  
...  

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.


2022 ◽  
pp. 1-1
Author(s):  
Kun Zhang ◽  
Zhendong Mao ◽  
Anan Liu ◽  
Yongdong Zhang

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-9 ◽  
Author(s):  
Xiaochao Fan ◽  
Hongfei Lin ◽  
Liang Yang ◽  
Yufeng Diao ◽  
Chen Shen ◽  
...  

Humor refers to the quality of being amusing. With the development of artificial intelligence, humor recognition is attracting a lot of research attention. Although phonetics and ambiguity have been introduced by previous studies, existing recognition methods still lack suitable feature design for neural networks. In this paper, we illustrate that phonetics structure and ambiguity associated with confusing words need to be learned for their own representations via the neural network. Then, we propose the Phonetics and Ambiguity Comprehension Gated Attention network (PACGA) to learn phonetic structures and semantic representation for humor recognition. The PACGA model can well represent phonetic information and semantic information with ambiguous words, which is of great benefit to humor recognition. Experimental results on two public datasets demonstrate the effectiveness of our model.


2020 ◽  
Author(s):  
Zhichao Xia ◽  
Cheng Wang ◽  
Roeland Hancock ◽  
Maaike Vandermosten ◽  
Fumiko Hoeft

AbstractThe importance of (inherited) genetic impact in reading development is well-established. De novo mutation is another important contributor that is recently gathering interest as a major liability of neurodevelopmental disorders, but has been neglected in reading research to date. Paternal age at childbirth (PatAGE) is known as the most prominent risk factor for de novo mutation, which has been shown repeatedly by molecular genetic studies. As one of the first effort, we performed a preliminary investigation of the relationship between PatAGE, offspring’s reading, and brain structure in a longitudinal neuroimaging study following 51 children from kindergarten through third grade. The results showed that greater PatAGE was associated significantly with worse reading, explaining an additional 9.5% of the variations after controlling for a number of confounds — including familial factors and cognitive-linguistic reading precursors. Moreover, this effect was mediated by volumetric maturation of the left posterior thalamus from ages 5 to 8. Complementary analyses indicated the PatAGE-related thalamic region was most likely located in the pulvinar nuclei and related to the dorsal attention network, by using offspring’s diffusion MRI data, brain atlases, and public datasets. Altogether, these findings provide novel insights into neurocognitive mechanisms underlying the PatAGE effect on reading acquisition during its earliest phase and suggest promising areas of future research.HighlightsPaternal age at childbirth (PatAGE) is negatively correlated with reading in offspring.PatAGE is related to volumetric maturation of the thalamus.Brain maturation mediates the PatAGE effect on reading.PatAGE-related thalamic area is connected to the dorsal attention network.


Author(s):  
N. A.  Simbirtseva ◽  

The article discusses the importance of competent perception and interpretation of a visual image / visual text by a person of the XXI century through the development of critical thinking skills that are relevant in the study of the humanities. The basis of existing methods and technologies is the psychological foundations of the perception of the visual and associated mental operations, which determine the “grasping” of the image in its integrity, the subjectivity of experiences, associative thinking and memorization. The main components of critical thinking are highlighted. The purposeful nature of the method of segment analysis of the visual text and its pedagogical significance in the humanities are emphasized. The main results of the study are as follows: the psychological characteristics of the perception of a visual image / visual text, features of critical thinking in the perception of visual information are revealed, the method of segment analysis and the practice of its application in the process of humanitarian education.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 38438-38447
Author(s):  
Zhong Ji ◽  
Zhigang Lin ◽  
Haoran Wang ◽  
Yuqing He

Sign in / Sign up

Export Citation Format

Share Document