Generating Natural Video Descriptions via Multimodal Processing

Author(s):  
Qin Jin ◽  
Junwei Liang ◽  
Xiaozhu Lin
2015 ◽  
Vol 95 ◽  
pp. 107-117 ◽  
Author(s):  
R.A. Otte ◽  
F.C.L. Donkers ◽  
M.A.K.A. Braeken ◽  
B.R.H. Van den Bergh

Target ◽  
2020 ◽  
Vol 32 (1) ◽  
pp. 37-58 ◽  
Author(s):  
Agnieszka Chmiel ◽  
Przemysław Janikowski ◽  
Agnieszka Lijewska

Abstract The present study focuses on (in)congruence of input between the visual and the auditory modality in simultaneous interpreting with text. We asked twenty-four professional conference interpreters to simultaneously interpret an aurally and visually presented text with controlled incongruences in three categories (numbers, names and control words), while measuring interpreting accuracy and eye movements. The results provide evidence for the dominance of the visual modality, which goes against the professional standard of following the auditory modality in the case of incongruence. Numbers enjoyed the greatest accuracy across conditions possibly due to simple cross-language semantic mappings. We found no evidence for a facilitation effect for congruent items, and identified an impeding effect of the presence of the visual text for incongruent items. These results might be interpreted either as evidence for the Colavita effect (in which visual stimuli take precedence over auditory ones) or as strategic behaviour applied by professional interpreters to avoid risk.


Author(s):  
Tianyun Li ◽  
Bicheng Fan

This study sets out to describe simultaneous interpreters' attention-sharing initiatives when exposed under input from both videotaped speech recording and real-time transcriptions. Separation of mental energy in acquiring visual input accords with the human brain's statistic optimization principle where the same property of an object is presented through diverse fashions. In examining professional interpreters' initiatives, the authors invited five professional English-Chinese conference interpreters to simultaneously interpret a videotaped speech with real-time captions generated by speech recognition engine while meanwhile monitoring their eye movements. The results indicate the professional interpreters' preferences in referring to visually presented captions along with the speaker's facial expressions, where low-frequency words, proper names, and numbers gained greater attention than words with higher frequency. This phenomenon might be explained by the working memory theory in which the central executive enables redundancy gains retrieved from dual-channel information.


Infancy ◽  
2009 ◽  
Vol 14 (5) ◽  
pp. 563-578 ◽  
Author(s):  
Faraz Farzin ◽  
Eric P. Charles ◽  
Susan M. Rivera

Neurocase ◽  
2013 ◽  
Vol 19 (3) ◽  
pp. 302-312 ◽  
Author(s):  
Monique Plaza ◽  
Laurent Capelle ◽  
Géraldine Maigret ◽  
Laurence Chaby

Sign in / Sign up

Export Citation Format

Share Document