scholarly journals 3D audio-visual speaker tracking with an adaptive particle filter

Author(s):  
Xinyuan Qian ◽  
Alessio Brutti ◽  
Maurizio Omologo ◽  
Andrea Cavallaro
Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-8
Author(s):  
Yidi Li ◽  
Hong Liu ◽  
Bing Yang ◽  
Runwei Ding ◽  
Yang Chen

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.


Acta Acustica ◽  
2021 ◽  
Vol 5 ◽  
pp. 20
Author(s):  
Matthias Blochberger ◽  
Franz Zotter

Six-Degree-of-Freedom (6DoF) audio rendering interactively synthesizes spatial audio signals for a variable listener perspective based on surround recordings taken at multiple perspectives distributed across the listening area in the acoustic scene. Methods that rely on recording-implicit directional information and interpolate the listener perspective without the attempt of localizing and extracting sounds often yield high audio quality, but are limited in spatial definition. Methods that perform sound localization, extraction, and rendering typically operate in the time-frequency domain and risk introducing artifacts such as musical noise. We propose to take advantage of the rich spatial information recorded in the broadband time-domain signals of the multitude of distributed first-order (B-format) recording perspectives. Broadband time-variant signal extraction retrieving direct signals and leaving residuals to approximate diffuse and spacious sounds is less of a quality risk, and likewise is the broadband re-encoding to enhance spatial definition of both signal types. To detect and track direct sound objects in this process, we combine the directional data recorded at the single perspectives into a volumetric multi-perspective activity map for particle-filter tracking. Our technical and perceptual evaluation confirms that this kind of processing enhances the otherwise limited spatial definition of direct-sound objects of other broadband but signal-independent virtual loudspeaker object (VLO) or Vector-Based Intensity Panning (VBIP) interpolation approaches.


2015 ◽  
Vol 23 (03) ◽  
pp. 1550010 ◽  
Author(s):  
Qiaoling Zhang ◽  
Zhe Chen ◽  
Fuliang Yin

Based on the combination of global coherence field (GCF) and distributed particle filter (DPF) a speaker tracking method is proposed for distributed microphone networks in this paper. In the distributed microphone network, each node comprises a microphone pair, and its generalized cross-correlation (GCC) function is estimated. Based on the average over all local GCC observations, a global coherence field-based pseudo-likelihood (GCF-PL) function is developed as the likelihood for a DPF. In the proposed method, all nodes share an identical particle set, and each node performs local particle filtering simultaneously. In the local particle filter, the likelihood GCF-PL for each particle weight is computed with an average consensus algorithm. With an identical particle set and the consistent estimate of GCF-PL for each particle weight, all individual nodes possess a common particle presentation for the global posterior of the speaker state, which is utilized by each node for an estimated global speaker position. Employing the GCF-PL as the likelihood for DPF, no assumption is required about the independence of nodes observations as well as observation noise statistics. Additionally, only local information exchange occurs among neighboring nodes; and finally each node has a global estimate of the speaker position. Simulation results demonstrate the validity of the proposed method.


Sign in / Sign up

Export Citation Format

Share Document