Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks

Singing voice detection is still a challenging task because the voice can be obscured by instruments having the same frequency band, and even the same timbre, produced by mimicking the mechanism of human singing. Because of the poor adaptability and complexity of feature engineering, there is a recent trend towards feature learning in which deep neural networks play the roles of feature extraction and classification. In this paper, we present two methods to explore the channel properties in the convolution neural network to improve the performance of singing voice detection by feature learning. First, channel attention learning is presented to measure the importance of a feature, in which two attention mechanisms are exploited, i.e., the scaled dot-product and squeeze-and-excitation. This method focuses on learning the importance of the feature map so that the neurons can place more attention on the more important feature maps. Second, the multi-scale representations are fed to the input channels, aiming at adding more information in terms of scale. Generally, different songs need different scales of a spectrogram to be represented, and multi-scale representations ensure the network can choose the best one for the task. In the experimental stage, we proved the effectiveness of the two methods based on three public datasets, with the accuracy performance increasing by up to 2.13 percent compared to its already high initial level.

Download Full-text

Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing

Electronics ◽

10.3390/electronics9091458 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1458

Author(s):

Xulong Zhang ◽

Yi Yu ◽

Yongwei Gao ◽

Xi Chen ◽

Wei Li

Keyword(s):

Time Domain ◽

Short Term Memory ◽

Block Size ◽

Detection Algorithm ◽

Singing Voice ◽

Convolutional Network ◽

Frame Size ◽

Voice Detection ◽

Public Datasets

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.

Download Full-text

Comparative study of singing voice detection based on deep neural networks and ensemble learning

Human-centric Computing and Information Sciences ◽

10.1186/s13673-018-0158-1 ◽

2018 ◽

Vol 8 (1) ◽

Cited By ~ 2

Author(s):

Shingchern D. You ◽

Chien-Hung Liu ◽

Woei-Kae Chen

Keyword(s):

Neural Networks ◽

Comparative Study ◽

Ensemble Learning ◽

Deep Neural Networks ◽

Singing Voice ◽

Voice Detection

Download Full-text

ASCNET: Adaptive-Scale Convolutional Neural Networks for Multi-Scale Feature Learning

2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) ◽

10.1109/isbi45749.2020.9098354 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mo Zhang ◽

Jie Zhao ◽

Xiang Li ◽

Li Zhang ◽

Quanzheng Li

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Feature Learning ◽

Scale Feature ◽

Multi Scale

Download Full-text

MSF-Net: Multi-Scale Feature Learning Network for Classification of Surface Defects of Multifarious Sizes

Sensors ◽

10.3390/s21155125 ◽

2021 ◽

Vol 21 (15) ◽

pp. 5125

Author(s):

Pengcheng Xu ◽

Zhongyuan Guo ◽

Lei Liang ◽

Xiaohang Xu

Keyword(s):

Defect Detection ◽

Surface Defects ◽

Receptive Fields ◽

Feature Learning ◽

Learning Ability ◽

Detection Methods ◽

Feature Maps ◽

Scale Feature ◽

Learning Network ◽

Multi Scale

In the field of surface defect detection, the scale difference of product surface defects is often huge. The existing defect detection methods based on Convolutional Neural Networks (CNNs) are more inclined to express macro and abstract features, and the ability to express local and small defects is insufficient, resulting in an imbalance of feature expression capabilities. In this paper, a Multi-Scale Feature Learning Network (MSF-Net) based on Dual Module Feature (DMF) extractor is proposed. DMF extractor is mainly composed of optimized Concatenated Rectified Linear Units (CReLUs) and optimized Inception feature extraction modules, which increases the diversity of feature receptive fields while reducing the amount of calculation; the feature maps of the middle layer with different sizes of receptive fields are merged to increase the richness of the receptive fields of the last layer of feature maps; the residual shortcut connections, batch normalization layer and average pooling layer are used to replace the fully connected layer to improve training efficiency, and make the multi-scale feature learning ability more balanced at the same time. Two representative multi-scale defect data sets are used for experiments, and the experimental results verify the advancement and effectiveness of the proposed MSF-Net in the detection of surface defects with multi-scale features.

Download Full-text

Hyper Spectral Image Classification using Multi Labelled, Multi-Scale and Multi-Angle CNN with MS-MA BT Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i3272.0789s319 ◽

2019 ◽

Vol 8 (9S3) ◽

pp. 1229-1234

Keyword(s):

Neural Networks ◽

High Performance ◽

Hyperspectral Image ◽

Support Vector ◽

Data Sets ◽

Feature Maps ◽

Feature Map ◽

Convolution Neural Networks ◽

Multi Scale ◽

Suggested Technique

For classifying the hyperspectral image (HSI), convolution neural networks are used widely as it gives high performance and better results. For stronger prediction this paper presents new structure that benefit from both MS - MA BT (multi-scale multi-angle breaking ties) and CNN algorithm. We build a new MS - MA BT and CNN architecture. It obtains multiple characteristics from the raw image as an input. This algorithm generates relevant feature maps which are fed into concatenating layer to form combined feature map. The obtained mixed feature map is then placed into the subsequent stages to estimate the final results for each hyperspectral pixel. Not only does the suggested technique benefit from improved extraction of characteristics from CNNs and MS-MA BT, but it also allows complete combined use of visual and temporal data. The performance of the suggested technique is evaluated using SAR data sets, and the results indicate that the MS-MA BT-based multi-functional training algorithm considerably increases identification precision. Recently, convolution neural networks have proved outstanding efficiency on multiple visual activities, including the ranking of common two-dimensional pictures. In this paper, the MS-MA BT multi-scale multi-angle CNN algorithm is used to identify hyperspectral images explicitly in the visual domain. Experimental outcomes based on several SAR image data sets show that the suggested technique can attain greater classification efficiency than some traditional techniques, such as support vector machines and conventional deep learning techniques.

Download Full-text

A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks

2015 23rd European Signal Processing Conference (EUSIPCO) ◽

10.1109/eusipco.2015.7362337 ◽

2015 ◽

Cited By ~ 9

Author(s):

Bernhard Lehner ◽

Gerhard Widmer ◽

Sebastian Bock

Keyword(s):

Neural Networks ◽

Real Time ◽

Recurrent Neural Networks ◽

Detection Method ◽

Low Latency ◽

Singing Voice ◽

Voice Detection

Download Full-text

CSNN: An Augmented Spiking based Framework with Perceptron-Inception

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/228 ◽

2018 ◽

Cited By ~ 16

Author(s):

Qi Xu ◽

Yu Qi ◽

Hang Yu ◽

Jiangrong Shen ◽

Huajin Tang ◽

...

Keyword(s):

Neural Networks ◽

Feature Extraction ◽

Feature Learning ◽

Classification Performance ◽

Learning Ability ◽

Extraction Ability ◽

Feature Maps ◽

Learning Capabilities ◽

Biological Realism ◽

High Level

Spiking Neural Networks (SNNs) represent and transmit information in spikes, which is considered more biologically realistic and computationally powerful than the traditional Artificial Neural Networks. The spiking neurons encode useful temporal information and possess highly anti-noise property. The feature extraction ability of typical SNNs is limited by shallow structures. This paper focuses on improving the feature extraction ability of SNNs in virtue of powerful feature extraction ability of Convolutional Neural Networks (CNNs). CNNs can extract abstract features resorting to the structure of the convolutional feature maps. We propose a CNN-SNN (CSNN) model to combine feature learning ability of CNNs with cognition ability of SNNs. The CSNN model learns the encoded spatial temporal representations of images in an event-driven way. We evaluate the CSNN model on the handwritten digits images dataset MNIST and its variational databases. In the presented experimental results, the proposed CSNN model is evaluated regarding learning capabilities, encoding mechanisms, robustness to noisy stimuli and its classification performance. The results show that CSNN behaves well compared to other cognitive models with significantly fewer neurons and training samples. Our work brings more biological realism into modern image classification models, with the hope that these models can inform how the brain performs this high-level vision task.

Download Full-text

PEMCNet: An Efficient Multi-Scale Point Feature Fusion Network for 3D LiDAR Point Cloud Classification

Remote Sensing ◽

10.3390/rs13214312 ◽

2021 ◽

Vol 13 (21) ◽

pp. 4312

Author(s):

Genping Zhao ◽

Weiguang Zhang ◽

Yeping Peng ◽

Heng Wu ◽

Zhuowei Wang ◽

...

Keyword(s):

Point Cloud ◽

Feature Learning ◽

Point Cloud Data ◽

Cloud Data ◽

Multi Scale ◽

Cloud Classification ◽

Public Datasets ◽

Point Cloud Classification ◽

3D Lidar ◽

Point Feature

Point cloud classification plays a significant role in Light Detection and Ranging (LiDAR) applications. However, most available multi-scale feature learning networks for large-scale 3D LiDAR point cloud classification tasks are time-consuming. In this paper, an efficient deep neural architecture denoted as Point Expanded Multi-scale Convolutional Network (PEMCNet) is developed to accurately classify the 3D LiDAR point cloud. Different from traditional networks for point cloud processing, PEMCNet includes successive Point Expanded Grouping (PEG) units and Absolute and Relative Spatial Embedding (ARSE) units for representative point feature learning. The PEG unit enables us to progressively increase the receptive field for each observed point and aggregate the feature of a point cloud at different scales but without increasing computation. The ARSE unit following the PEG unit furthermore realizes representative encoding of points relationship, which effectively preserves the geometric details between points. We evaluate our method on both public datasets (the Urban Semantic 3D (US3D) dataset and Semantic3D benchmark dataset) and our new collected Unmanned Aerial Vehicle (UAV) based LiDAR point cloud data of the campus of Guangdong University of Technology. In comparison with four available state-of-the-art methods, our methods ranked first place regarding both efficiency and accuracy. It was observed on the public datasets that with a 2% increase in classification accuracy, over 26% improvement of efficiency was achieved at the same time compared to the second efficient method. Its potential value is also tested on the newly collected point cloud data with over 91% of classification accuracy and 154 ms of processing time.

Download Full-text

Multi-Scale Explainable Feature Learning for Pathological Image Analysis Using Convolutional Neural Networks

2020 IEEE International Conference on Image Processing (ICIP) ◽

10.1109/icip40778.2020.9190693 ◽

2020 ◽

Author(s):

Kazuki Uehara ◽

Masahiro Murakawa ◽

Hirokazu Nosato ◽

Hidenori Sakanashi

Keyword(s):

Neural Networks ◽

Image Analysis ◽

Convolutional Neural Networks ◽

Feature Learning ◽

Multi Scale ◽

Pathological Image

Download Full-text

Classification-Based Singing Melody Extraction Using Deep Convolutional Neural Networks

10.20944/preprints201711.0027.v1 ◽

2017 ◽

Author(s):

Sangeun Kum ◽

Juhan Nam

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Pitch Contour ◽

Singing Voice ◽

Deep Convolutional Neural Networks ◽

Voice Activity Detector ◽

Proposed Model ◽

Public Datasets ◽

Melody Contour ◽

Extraction Model

Singing melody extraction is the task that identifies the melody pitch contour of singing voice from polyphonic music. Most of the traditional melody extraction algorithms are based on calculating salient pitch candidates or separating the melody source from the mixture. Recently, classification-based approach based on deep learning has drawn much attentions. In this paper, we present a classification-based singing melody extraction model using deep convolutional neural networks. The proposed model consists of a singing pitch extractor (SPE) and a singing voice activity detector (SVAD). The SPE is trained to predict a high-resolution pitch label of singing voice from a short segment of spectrogram. This allows the model to predict highly continuous curves. The melody contour is smoothed further by post-processing the output of the melody extractor. The SVAD is trained to determine if a long segment of mel-spectrogram contains a singing voice. This often produces voice false alarm errors around the boundary of singing segments. We reduced them by exploiting the output of the SPE. Finally, we evaluate the proposed melody extraction model on several public datasets. The results show that the proposed model is comparable to state-of-the-art algorithms.

Download Full-text