scholarly journals Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Electronics ◽  
2021 ◽  
Vol 10 (21) ◽  
pp. 2654
Author(s):  
Jiu Lou ◽  
Decheng Zuo ◽  
Zhan Zhang ◽  
Hongwei Liu

In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory–visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN–LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.

Sensors ◽  
2018 ◽  
Vol 18 (7) ◽  
pp. 2041 ◽  
Author(s):  
David Valiente ◽  
Luis Payá ◽  
Luis Jiménez ◽  
Jose Sebastián ◽  
Óscar Reinoso

2019 ◽  
Vol 27 ◽  
pp. 165-173
Author(s):  
Jung-Hun Kim ◽  
Ji-Eun Park ◽  
In-Hee Ji ◽  
Chul-Ho Won ◽  
Jong-Min Lee ◽  
...  

2020 ◽  
pp. 002383091989888
Author(s):  
Luma Miranda ◽  
Marc Swerts ◽  
João Moraes ◽  
Albert Rilliard

This paper presents the results of three perceptual experiments investigating the role of auditory and visual channels for the identification of statements and echo questions in Brazilian Portuguese. Ten Brazilian speakers (five male) were video-recorded (frontal view of the face) while they produced a sentence (“ Como você sabe”), either as a statement (meaning “ As you know.”) or as an echo question (meaning “ As you know?”). Experiments were set up including the two different intonation contours. Stimuli were presented in conditions with clear and degraded audio as well as congruent and incongruent information from both channels. Results show that Brazilian listeners were able to distinguish statements and questions prosodically and visually, with auditory cues being dominant over visual ones. In noisy conditions, the visual channel improved the interpretation of prosodic cues robustly, while it degraded them in conditions where the visual information was incongruent with the auditory information. This study shows that auditory and visual information are integrated during speech perception, also when applied to prosodic patterns.


1975 ◽  
Vol 69 (5) ◽  
pp. 226-233
Author(s):  
Sally Rogow

The blind child builds his perceptions from tactual (haptic) and auditory information. Assumptions on the part of professionals that tactual and visual data are identical can result in misconceptions that may lead to delayed development and distortions of cognitive process in blind children. A review of research on the perception of form and spatial relationships suggests that differences between tactual and visual information result in differences in perceptual organization. However, studies indicate that blind children reach developmental milestones (e.g., conservation) at approximately the same ages as sighted children.


2012 ◽  
Vol 25 (0) ◽  
pp. 148
Author(s):  
Marcia Grabowecky ◽  
Emmanuel Guzman-Martinez ◽  
Laura Ortega ◽  
Satoru Suzuki

Watching moving lips facilitates auditory speech perception when the mouth is attended. However, recent evidence suggests that visual attention and awareness are mediated by separate mechanisms. We investigated whether lip movements suppressed from visual awareness can facilitate speech perception. We used a word categorization task in which participants listened to spoken words and determined as quickly and accurately as possible whether or not each word named a tool. While participants listened to the words they watched a visual display that presented a video clip of the speaker synchronously speaking the auditorily presented words, or the same speaker articulating different words. Critically, the speaker’s face was either visible (the aware trials), or suppressed from awareness using continuous flash suppression. Aware and suppressed trials were randomly intermixed. A secondary probe-detection task ensured that participants attended to the mouth region regardless of whether the face was visible or suppressed. On the aware trials responses to the tool targets were no faster with the synchronous than asynchronous lip movements, perhaps because the visual information was inconsistent with the auditory information on 50% of the trials. However, on the suppressed trials responses to the tool targets were significantly faster with the synchronous than asynchronous lip movements. These results demonstrate that even when a random dynamic mask renders a face invisible, lip movements are processed by the visual system with sufficiently high temporal resolution to facilitate speech perception.


2020 ◽  
Vol 31 (01) ◽  
pp. 030-039 ◽  
Author(s):  
Aaron C. Moberly ◽  
Kara J. Vasil ◽  
Christin Ray

AbstractAdults with cochlear implants (CIs) are believed to rely more heavily on visual cues during speech recognition tasks than their normal-hearing peers. However, the relationship between auditory and visual reliance during audiovisual (AV) speech recognition is unclear and may depend on an individual’s auditory proficiency, duration of hearing loss (HL), age, and other factors.The primary purpose of this study was to examine whether visual reliance during AV speech recognition depends on auditory function for adult CI candidates (CICs) and adult experienced CI users (ECIs).Participants included 44 ECIs and 23 CICs. All participants were postlingually deafened and had met clinical candidacy requirements for cochlear implantation.Participants completed City University of New York sentence recognition testing. Three separate lists of twelve sentences each were presented: the first in the auditory-only (A-only) condition, the second in the visual-only (V-only) condition, and the third in combined AV fashion. Each participant’s amount of “visual enhancement” (VE) and “auditory enhancement” (AE) were computed (i.e., the benefit to AV speech recognition of adding visual or auditory information, respectively, relative to what could potentially be gained). The relative reliance of VE versus AE was also computed as a VE/AE ratio.VE/AE ratio was predicted inversely by A-only performance. Visual reliance was not significantly different between ECIs and CICs. Duration of HL and age did not account for additional variance in the VE/AE ratio.A shift toward visual reliance may be driven by poor auditory performance in ECIs and CICs. The restoration of auditory input through a CI does not necessarily facilitate a shift back toward auditory reliance. Findings suggest that individual listeners with HL may rely on both auditory and visual information during AV speech recognition, to varying degrees based on their own performance and experience, to optimize communication performance in real-world listening situations.


2019 ◽  
Vol 32 (2) ◽  
pp. 87-109 ◽  
Author(s):  
Galit Buchs ◽  
Benedetta Heimler ◽  
Amir Amedi

Abstract Visual-to-auditory Sensory Substitution Devices (SSDs) are a family of non-invasive devices for visual rehabilitation aiming at conveying whole-scene visual information through the intact auditory modality. Although proven effective in lab environments, the use of SSDs has yet to be systematically tested in real-life situations. To start filling this gap, in the present work we tested the ability of expert SSD users to filter out irrelevant background noise while focusing on the relevant audio information. Specifically, nine blind expert users of the EyeMusic visual-to-auditory SSD performed a series of identification tasks via SSDs (i.e., shape, color, and conjunction of the two features). Their performance was compared in two separate conditions: silent baseline, and with irrelevant background sounds from real-life situations, using the same stimuli in a pseudo-random balanced design. Although the participants described the background noise as disturbing, no significant performance differences emerged between the two conditions (i.e., noisy; silent) for any of the tasks. In the conjunction task (shape and color) we found a non-significant trend for a disturbing effect of the background noise on performance. These findings suggest that visual-to-auditory SSDs can indeed be successfully used in noisy environments and that users can still focus on relevant auditory information while inhibiting irrelevant sounds. Our findings take a step towards the actual use of SSDs in real-life situations while potentially impacting rehabilitation of sensory deprived individuals.


Sign in / Sign up

Export Citation Format

Share Document