scholarly journals An Ensemble of Convolutional Neural Networks for Audio Classification

2021 ◽  
Vol 11 (13) ◽  
pp. 5796
Author(s):  
Loris Nanni ◽  
Gianluca Maguolo ◽  
Sheryl Brahnam ◽  
Michelangelo Paci

Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in urban environments. Since environmental audio datasets are often limited in size, a robust model able to perform well across different datasets is of strong research interest. In this paper, ensembles of classifiers are combined that exploit six data augmentation techniques and four signal representations for retraining five pre-trained convolutional neural networks (CNNs); these ensembles are tested on three freely available environmental audio benchmark datasets: (i) bird calls, (ii) cat sounds, and (iii) the Environmental Sound Classification (ESC-50) database for identifying sources of noise in environments. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. The best-performing ensembles are compared and shown to either outperform or perform comparatively to the best methods reported in the literature on these datasets, including on the challenging ESC-50 dataset. We obtained a 97% accuracy on the bird dataset, 90.51% on the cat dataset, and 88.65% on ESC-50 using different approaches. In addition, the same ensemble model trained on the three datasets managed to reach the same results on the bird and cat datasets while losing only 0.1% on ESC-50. Thus, we have managed to create an off-the-shelf ensemble that can be trained on different datasets and reach performances competitive with the state of the art.

Author(s):  
Jinfang Zeng ◽  
Youming Li ◽  
Yu Zhang ◽  
Da Chen

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. To date, a variety of signal processing and machine learning techniques have been applied to ESC task, including matrix factorization, dictionary learning, wavelet filterbanks and deep neural networks. It is observed that features extracted from deeper networks tend to achieve higher performance than those extracted from shallow networks. However, in ESC task, only the deep convolutional neural networks (CNNs) which contain several layers are used and the residual networks are ignored, which lead to degradation in the performance. Meanwhile, a possible explanation for the limited exploration of CNNs and the difficulty to improve on simpler models is the relative scarcity of labeled data for ESC. In this paper, a residual network called EnvResNet for the ESC task is proposed. In addition, we propose to use audio data augmentation to overcome the problem of data scarcity. The experiments will be performed on the ESC-50 database. Combined with data augmentation, the proposed model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches in terms of classification accuracy.


Author(s):  
Ke Zhang ◽  
Yu Su ◽  
Jingyu Wang ◽  
Sanyu Wang ◽  
Yanhua Zhang

At present, the environment sound recognition system mainly identifies environment sounds with deep neural networks and a wide variety of auditory features. Therefore, it is necessary to analyze which auditory features are more suitable for deep neural networks based ESCR systems. In this paper, we chose three sound features which based on two widely used filters:the Mel and Gammatone filter banks. Subsequently, the hybrid feature MGCC is presented. Finally, a deep convolutional neural network is proposed to verify which features are more suitable for environment sound classification and recognition tasks. The experimental results show that the signal processing features are better than the spectrogram features in the deep neural network based environmental sound recognition system. Among all the acoustic features, the MGCC feature achieves the best performance than other features. Finally, the MGCC-CNN model proposed in this paper is compared with the state-of-the-art environmental sound classification models on the UrbanSound 8K dataset. The results show that the proposed model has the best classification accuracy.


2020 ◽  
Vol 10 (17) ◽  
pp. 5965
Author(s):  
Yu-Kai Lin ◽  
Mu-Chun Su ◽  
Yi-Zeng Hsieh

Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches using deeper network architectures. The raw signals could be better analyzed in the proposed network. We also presented a discussion of several kinds of network settings, and with the spectrogram-like conversion, our network could reach an accuracy of 73.55% in the open-audio-dataset “Dataset for Environmental Sound Classification 50” (ESC50). This study also proposed a network architecture that could combine different kinds of network feeds with different features. With the help of global pooling, a flexible fusion way was integrated into the network. Our experiment successfully combined two different networks with different audio feature inputs (a raw audio signal and the log-mel spectrum). Using the above settings, the proposed ParallelNet finally reached the accuracy of 81.55% in ESC50, which also reached the recognition level of human beings.


2021 ◽  
Author(s):  
George Zhou ◽  
Yunchan Chen ◽  
Candace Chien

Abstract Background: The application of machine learning to cardiac auscultation has the potential to improve the accuracy and efficiency of both routine and point-of-care screenings. The use of Convolutional Neural Networks (CNN) on heart sound spectrograms in particular has defined state-of-the-art performance. However, the relative paucity of patient data remains a significant barrier to creating models that can adapt to the wide range of between-subject variability. To that end, we examined a CNN model’s performance on automated heart sound classification, before and after various forms of data augmentation, and aimed to identify the most optimal augmentation methods for cardiac spectrogram analysis.Results: We built a standard CNN model to classify cardiac sound recordings as either normal or abnormal. The baseline control model achieved an ROC AUC of 0.945±0.016. Among the data augmentation techniques explored, horizontal flipping of the spectrogram image improved the model performance the most, with an ROC AUC of 0.957±0.009. Principal component analysis color augmentation (PCA) and perturbations of saturation-value (SV) of the hue-saturation-value (HSV) color scale achieved an ROC AUC of 0.949±0.014 and 0.946±0.019, respectively. Time and frequency masking resulted in an ROC AUC of 0.948±0.012. Pitch shifting, time stretching and compressing, noise injection, vertical flipping, and applying random color filters all negatively impacted model performance.Conclusion: Data augmentation can improve classification accuracy by expanding and diversifying the dataset, which protects against overfitting to random variance. However, data augmentation is necessarily domain specific. For example, methods like noise injection have found success in other areas of automated sound classification, but in the context of cardiac sound analysis, noise injection can mimic the presence of murmurs and worsen model performance. Thus, care should be taken to ensure clinically appropriate forms of data augmentation to avoid negatively impacting model performance.


Sign in / Sign up

Export Citation Format

Share Document