Visual Question Answering (VQA) is a multimodal research related to Computer
Vision (CV) and Natural Language Processing (NLP). How to better obtain
useful information from images and questions and give an accurate answer to
the question is the core of the VQA task. This paper presents a VQA model
based on multimodal encoders and decoders with gate attention (MEDGA). Each
encoder and decoder block in the MEDGA applies not only self-attention and
crossmodal attention but also gate attention, so that the new model can
better focus on inter-modal and intra-modal interactions simultaneously
within visual and language modality. Besides, MEDGA further filters out
noise information irrelevant to the results via gate attention and finally
outputs attention results that are closely related to visual features and
language features, which makes the answer prediction result more accurate.
Experimental evaluations on the VQA 2.0 dataset and the ablation experiments
under different conditions prove the effectiveness of MEDGA. In addition,
the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds
many existing methods.