scholarly journals SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features

Author(s):  
Gwantae Kim ◽  
David K. Han ◽  
Hanseok Ko
2021 ◽  
Author(s):  
Gwantae Kim ◽  
David K. Han ◽  
Hanseok Ko

A mixed sample data augmentation strategy is proposed to enhance the performance of models on audio scene classification, sound event classification, and speech enhancement tasks. While there have been several augmentation methods shown to be effective in improving image classification performance, their efficacy toward time-frequency domain features of audio is not assured. We propose a novel audio data augmentation approach named "Specmix" specifically designed for dealing with time-frequency domain features. The augmentation method consists of mixing two different data samples by applying time-frequency masks effective in preserving the spectral correlation of each audio sample. Our experiments on acoustic scene classification, sound event classification, and speech enhancement tasks show that the proposed Specmix improves the performance of various neural network architectures by a maximum of 2.7\%.


2020 ◽  
Author(s):  
Demi Guo ◽  
Yoon Kim ◽  
Alexander Rush

Author(s):  
Wentao Xie ◽  
Qian Zhang ◽  
Jin Zhang

Smart eyewear (e.g., AR glasses) is considered to be the next big breakthrough for wearable devices. The interaction of state-of-the-art smart eyewear mostly relies on the touchpad which is obtrusive and not user-friendly. In this work, we propose a novel acoustic-based upper facial action (UFA) recognition system that serves as a hands-free interaction mechanism for smart eyewear. The proposed system is a glass-mounted acoustic sensing system with several pairs of commercial speakers and microphones to sense UFAs. There are two main challenges in designing the system. The first challenge is that the system is in a severe multipath environment and the received signal could have large attenuation due to the frequency-selective fading which will degrade the system's performance. To overcome this challenge, we design an Orthogonal Frequency Division Multiplexing (OFDM)-based channel state information (CSI) estimation scheme that is able to measure the phase changes caused by a facial action while mitigating the frequency-selective fading. The second challenge is that because the skin deformation caused by a facial action is tiny, the received signal has very small variations. Thus, it is hard to derive useful information directly from the received signal. To resolve this challenge, we apply a time-frequency analysis to derive the time-frequency domain signal from the CSI. We show that the derived time-frequency domain signal contains distinct patterns for different UFAs. Furthermore, we design a Convolutional Neural Network (CNN) to extract high-level features from the time-frequency patterns and classify the features into six UFAs, namely, cheek-raiser, brow-raiser, brow-lower, wink, blink and neutral. We evaluate the performance of our system through experiments on data collected from 26 subjects. The experimental result shows that our system can recognize the six UFAs with an average F1-score of 0.92.


Sign in / Sign up

Export Citation Format

Share Document