Robust Acoustic Scene Classification in thePresence of Active Foreground Speech
We present an iVector based Acoustic Scene Clas-sification (ASC) system suited for real life settings where activeforeground speech can be present. In the proposed system, eachrecording is represented by a fixed-length iVector that modelsthe recording’s important properties. A regularized Gaussianbackend classifier with class-specific covariance models is usedto extract the relevant acoustic scene information from theseiVectors. To alleviate the large performance degradation when aforeground speaker dominates the captured signal, we investigatethe use of the iVector framework on Mel-Frequency CepstralCoefficients (MFCCs) that are derived from an estimate of thenoise power spectral density. This noise-floor can be extracted in astatistical manner for single channel recordings. We show that theuse of noise-floor features is complementary to multi-conditiontraining in which foreground speech is added to training signalto reduce the mismatch between training and testing conditions.Experimental results on the DCASE 2016 Task 1 dataset showthat the noise-floor based features and multi-condition trainingrealize significant classification accuracy gains of up to more than25 percentage points (absolute) in the most adverse conditions.These promising results can further facilitate the integration ofASC in resource-constrained devices such as hearables.