Time-Frequency Masking Based Online Speech Enhancement with Multi-Channel Data Using Convolutional Neural Networks

Author(s):  
Soumitro Chakrabarty ◽  
DeLiang Wang ◽  
Emanuel A. P. Habets
2021 ◽  
Author(s):  
David A. Tovar ◽  
Tijl Grootswagers ◽  
James Jun ◽  
Oakyoon Cha ◽  
Randolph Blake ◽  
...  

Humans are able to recognize objects under a variety of noisy conditions, so models of the human visual system must account for how this feat is accomplished. In this study, we investigated how image perturbations, specifically reducing images to their low spatial frequency (LSF) components, affected correspondence between convolutional neural networks (CNNs) and brain signals recorded using magnetoencephalography (MEG). Using the high temporal resolution of MEG, we found that CNN-Brain correspondence for deeper and more complex layers across CNN architectures emerged earlier for LSF images than for their unfiltered broadband counterparts. The early emergence of LSF components is consistent with the coarse-to-fine theoretical framework for visual image processing, but surprisingly shows that LSF signals from images are more prominent when high spatial frequencies are removed. In addition, we decomposed MEG signals into oscillatory components and found correspondence varied based on frequency bands, painting a full picture of how CNN-Brain correspondence varies with time, frequency, and MEG sensor locations. Finally, we varied image properties of CNN training sets, and found marked changes in CNN processing dynamics and correspondence to brain activity. In sum, we show that image perturbations affect CNN-Brain correspondence in unexpected ways, as well as provide a rich methodological framework for assessing CNN-Brain correspondence across space, time, and frequency.


Sensors ◽  
2020 ◽  
Vol 20 (13) ◽  
pp. 3768
Author(s):  
Chanjun Chun ◽  
Kwang Myung Jeon ◽  
Wooyeol Choi

Deep neural networks (DNNs) have achieved significant advancements in speech processing, and numerous types of DNN architectures have been proposed in the field of sound localization. When a DNN model is deployed for sound localization, a fixed input size is required. This is generally determined by the number of microphones, the fast Fourier transform size, and the frame size. if the numbers or configurations of the microphones change, the DNN model should be retrained because the size of the input features changes. in this paper, we propose a configuration-invariant sound localization technique using the azimuth-frequency representation and convolutional neural networks (CNNs). the proposed CNN model receives the azimuth-frequency representation instead of time-frequency features as the input features. the proposed model was evaluated in different environments from the microphone configuration in which it was originally trained. for evaluation, single sound source is simulated using the image method. Through the evaluations, it was confirmed that the localization performance was superior to the conventional steered response power phase transform (SRP-PHAT) and multiple signal classification (MUSIC) methods.


Sign in / Sign up

Export Citation Format

Share Document