Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.

Download Full-text

Singing Voice Detection: A Survey

Entropy ◽

10.3390/e24010114 ◽

2022 ◽

Vol 24 (1) ◽

pp. 114

Author(s):

Ramy Monir ◽

Daniel Kostrzewa ◽

Dariusz Mrozek

Keyword(s):

State Of The Art ◽

Singing Voice ◽

Convolutional Networks ◽

Voice Detection ◽

Deep Focus ◽

Public Datasets ◽

Melody Transcription ◽

Convolutional Lstm ◽

Singing Voice Separation

Singing voice detection or vocal detection is a classification task that determines whether there is a singing voice in a given audio segment. This process is a crucial preprocessing step that can be used to improve the performance of other tasks such as automatic lyrics alignment, singing melody transcription, singing voice separation, vocal melody extraction, and many more. This paper presents a survey on the techniques of singing voice detection with a deep focus on state-of-the-art algorithms such as convolutional LSTM and GRU-RNN. It illustrates a comparison between existing methods for singing voice detection, mainly based on the Jamendo and RWC datasets. Long-term recurrent convolutional networks have reached impressive results on public datasets. The main goal of the present paper is to investigate both classical and state-of-the-art approaches to singing voice detection.

Download Full-text

3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

Future Internet ◽

10.3390/fi11020042 ◽

2019 ◽

Vol 11 (2) ◽

pp. 42 ◽

Cited By ~ 5

Author(s):

Sheeraz Arif ◽

Jing Wang ◽

Tehseen Ul Hassan ◽

Zesong Fei

Keyword(s):

Short Term Memory ◽

Research Work ◽

Video Frame ◽

Feature Maps ◽

Convolutional Network ◽

Convolutional Networks ◽

Spatio Temporal ◽

3D Cnn ◽

Public Datasets ◽

Motion Map

Human activity recognition is an active field of research in computer vision with numerous applications. Recently, deep convolutional networks and recurrent neural networks (RNN) have received increasing attention in multimedia studies, and have yielded state-of-the-art results. In this research work, we propose a new framework which intelligently combines 3D-CNN and LSTM networks. First, we integrate discriminative information from a video into a map called a ‘motion map’ by using a deep 3-dimensional convolutional network (C3D). A motion map and the next video frame can be integrated into a new motion map, and this technique can be trained by increasing the training video length iteratively; then, the final acquired network can be used for generating the motion map of the whole video. Next, a linear weighted fusion scheme is used to fuse the network feature maps into spatio-temporal features. Finally, we use a Long-Short-Term-Memory (LSTM) encoder-decoder for final predictions. This method is simple to implement and retains discriminative and dynamic information. The improved results on benchmark public datasets prove the effectiveness and practicability of the proposed method.

Download Full-text

Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks

Applied Sciences ◽

10.3390/app112411838 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11838

Author(s):

Wenming Gui ◽

Yukun Li ◽

Xian Zang ◽

Jinglan Zhang

Keyword(s):

Neural Networks ◽

Feature Learning ◽

Singing Voice ◽

Feature Maps ◽

Multi Scale ◽

Voice Detection ◽

Public Datasets ◽

The Voice ◽

Channel Properties ◽

Accuracy Performance

Singing voice detection is still a challenging task because the voice can be obscured by instruments having the same frequency band, and even the same timbre, produced by mimicking the mechanism of human singing. Because of the poor adaptability and complexity of feature engineering, there is a recent trend towards feature learning in which deep neural networks play the roles of feature extraction and classification. In this paper, we present two methods to explore the channel properties in the convolution neural network to improve the performance of singing voice detection by feature learning. First, channel attention learning is presented to measure the importance of a feature, in which two attention mechanisms are exploited, i.e., the scaled dot-product and squeeze-and-excitation. This method focuses on learning the importance of the feature map so that the neurons can place more attention on the more important feature maps. Second, the multi-scale representations are fed to the input channels, aiming at adding more information in terms of scale. Generally, different songs need different scales of a spectrogram to be represented, and multi-scale representations ensure the network can choose the best one for the task. In the experimental stage, we proved the effectiveness of the two methods based on three public datasets, with the accuracy performance increasing by up to 2.13 percent compared to its already high initial level.

Download Full-text

Altimeter Observation-Based Eddy Nowcasting Using an Improved Conv-LSTM Network

Remote Sensing ◽

10.3390/rs11070783 ◽

2019 ◽

Vol 11 (7) ◽

pp. 783 ◽

Cited By ~ 4

Author(s):

Chunyong Ma ◽

Siqing Li ◽

Anni Wang ◽

Jie Yang ◽

Ge Chen

Keyword(s):

Time Series Data ◽

Short Term Memory ◽

Remote Sensing Data ◽

Detection Algorithm ◽

Ocean Model ◽

Series Data ◽

Altimeter Data ◽

Convolutional Network ◽

Data Set ◽

Lstm Network

Eddies can be identified and tracked based on satellite altimeter data. However, few studies have focused on nowcasting the evolution of eddies using remote sensing data. In this paper, an improved Convolutional Long Short-Term Memory (Conv-LSTM) network named Prednet is used for eddy nowcasting. Prednet, which uses a deep, recurrent convolutional network with both bottom-up and top-down connects, has the ability to learn the temporal and spatial relationships associated with time series data. The network can effectively simulate and reconstruct the spatiotemporal characteristics of the future sea level anomaly (SLA) data. Based on the SLA data products provided by Archiving, Validation, and Interpretation of Satellite Oceanographic (AVISO) from 1993 to 2018, combined with an SLA-based eddy detection algorithm, seven-day eddy nowcasting experiments are conducted on the eddies in South China Sea. The matching ratio is defined as the percentage of true eddies that can be successfully predicted by Conv-LSTM network. On the first day of the nowcasting, matching ratio for eddies with diameters greater than 100 km is 95%, and the average matching ratio of the seven-day nowcasting is approximately 60%. In order to verify the performance of nowcasting method, two experiments were set up. A typical anticyclonic eddy shedding from Kuroshio in January 2017 was used to verify this nowcasting algorithm’s performance on single eddy, with the mean eddy center error is 11.2 km. Moreover, compared with the eddies detected in the Hybrid Coordinate Ocean Model data set (HYCOM), the eddies predicted with Conv-LSTM networks are closer to the eddies detected in the AVISO SLA data set, indicating that deep learning method can effectively nowcast eddies.

Download Full-text

Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/450 ◽

2020 ◽

Author(s):

Yucheng Zhao ◽

Chong Luo ◽

Zheng-Jun Zha ◽

Wenjun Zeng

Keyword(s):

Time Domain ◽

Single Channel ◽

Sequence Length ◽

Separation Performance ◽

Speech Separation ◽

Convolutional Network ◽

Multi Scale ◽

Sequence Modeling ◽

The Time Domain

In this paper, we introduce Transformer to the time-domain methods for single-channel speech separation. Transformer has the potential to boost speech separation performance because of its strong sequence modeling capability. However, its computational complexity, which grows quadratically with the sequence length, has made it largely inapplicable to speech applications. To tackle this issue, we propose a novel variation of Transformer, named multi-scale group Transformer (MSGT). The key ideas are group self-attention, which significantly reduces the complexity, and multi-scale fusion, which retains Transform's ability to capture long-term dependency. We implement two versions of MSGT with different complexities, and apply them to a well-known time-domain speech separation method called Conv-TasNet. By simply replacing the original temporal convolutional network (TCN) with MSGT, our approach called MSGT-TasNet achieves a large gain over Conv-TasNet on both WSJ0-2mix and WHAM! benchmarks. Without bells and whistles, the performance of MSGT-TasNet is already on par with the SOTA methods.

Download Full-text

Fall Detection Algorithm Based on MPU6050 and Long-Term Short-Term Memory network

2020 International Automatic Control Conference (CACS) ◽

10.1109/cacs50047.2020.9289769 ◽

2020 ◽

Author(s):

Sheng-Ta Hsieh ◽

Chun-Ling Lin

Keyword(s):

Short Term Memory ◽

Fall Detection ◽

Detection Algorithm ◽

Short Term ◽

Term Memory ◽

Memory Network

Download Full-text

Computer-Aided Recognition and Analysis of Abnormal Behavior in Video

Computer-Aided Design and Applications ◽

10.14733/cadaps.2021.s3.34-45 ◽

2020 ◽

Vol 18 (S3) ◽

pp. 34-45

Author(s):

Zhingtang Zhao ◽

Qingtao Wu

Keyword(s):

Short Term Memory ◽

Abnormal Behavior ◽

Video Sequences ◽

Video Frame ◽

Dynamic Features ◽

Short Term ◽

Behavior Recognition ◽

Convolutional Network ◽

Computer Aided

In intelligent computer-aided video abnormal behavior recognition, pedestrian behavior analysis technology can detect and handle abnormal behaviors in time, which has great practical value in ensuring social safety. We analyze a deep learning video behavior recognition network that has advantages in current research. The network first sparsely sampled the input video to obtain the video frame of each video segment, and then used a two-dimensional convolutional network to extract the characteristics of each video frame, then used a three-dimensional network to fuse them. The method realizes the recognition of long-term and short-term actions in the video at the same time. In order to overcome the shortcoming of the large amount of calculation in the 3D convolution part of the network, this paper proposes an improvement to this module in the network, and proposes a mobile 3D convolution network structure. Aiming at the problem of low utilization of long-term motion features in video sequences, this paper constructs a deep residual module by introducing long and short-term memory networks, residual connection design, etc., to fully and effectively utilize the long-term dynamic features in video sequences. Aiming at the problem of large differences in similar actions and small differences between classes in abnormal behavior videos, this paper proposes a 2CSoftmax function based on double center loss to optimize the network model, which is beneficial to maximize the distance between classes and minimize the distance between classes, so as to realize the classification and recognition of similar actions and improve the recognition accuracy.

Download Full-text

Spider U-Net: Incorporating Inter-Slice Connectivity Using LSTM for 3D Blood Vessel Segmentation

Applied Sciences ◽

10.3390/app11052014 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2014

Author(s):

Kyeorye Lee ◽

Leonard Sunwoo ◽

Tackeun Kim ◽

Kyong Joon Lee

Keyword(s):

Blood Vessel ◽

Blood Vessels ◽

Short Term Memory ◽

Feeding Strategy ◽

Model Performance ◽

Vessel Segmentation ◽

Medical Image Segmentation ◽

Convolutional Network ◽

Blood Vessel Segmentation ◽

Public Datasets

Blood vessel segmentation (BVS) of 3D medical imaging such as computed tomography and magnetic resonance angiography (MRA) is an essential task in the clinical field. Automation of 3D BVS using deep supervised learning is being researched, and U-Net-based approaches, which are considered as standard for medical image segmentation, are proposed a lot. However, the inherent characteristics of blood vessels, e.g., they are complex and narrow, as well as the resolution and sensitivity of the imaging modalities increases the difficulty of 3D BVS. We propose a novel U-Net-based model named Spider U-Net for 3D BVS that considers the connectivity of the blood vessels between the axial slices. To achieve this, long short-term memory (LSTM), which can capture the context of the consecutive data, is inserted into the baseline model. We also propose a data feeding strategy that augments data and makes Spider U-Net stable. Spider U-Net outperformed 2D U-Net, 3D U-Net, and the fully convolutional network-recurrent neural network (FCN-RNN) in dice coefficient score (DSC) by 0.048, 0.077, and 0.041, respectively, for our in-house brain MRA dataset and also achieved the highest DSC for two public datasets. The results imply that considering inter-slice connectivity with LSTM improves model performance in the 3D BVS task.

Download Full-text

Conceptual short-term memory (CSTM) supports core claims of Christiansen and Chater

Behavioral and Brain Sciences ◽

10.1017/s0140525x15000928 ◽

2016 ◽

Vol 39 ◽

Author(s):

Mary C. Potter

Keyword(s):

Working Memory ◽

Rapid Serial Visual Presentation ◽

Language Comprehension ◽

Short Term Memory ◽

Visual Presentation ◽

Long Term Memory ◽

Short Term ◽

Term Memory ◽

Use Of Knowledge

AbstractRapid serial visual presentation (RSVP) of words or pictured scenes provides evidence for a large-capacity conceptual short-term memory (CSTM) that momentarily provides rich associated material from long-term memory, permitting rapid chunking (Potter 1993; 2009; 2012). In perception of scenes as well as language comprehension, we make use of knowledge that briefly exceeds the supposed limits of working memory.

Download Full-text

Comparison of Auditory, Language, Memory, and Attention Abilities in Children With and Without Listening Difficulties

American Journal of Audiology ◽

10.1044/2020_aja-20-00018 ◽

2020 ◽

Vol 29 (4) ◽

pp. 710-727

Author(s):

Beula M. Magimairaj ◽

Naveen K. Nagaraj ◽

Alexander V. Sergeev ◽

Natalie J. Benafield

Keyword(s):

Auditory Processing ◽

Short Term Memory ◽

Group Differences ◽

Long Term Memory ◽

Short Term ◽

Significant Group ◽

Term Memory ◽

Memory And Attention ◽

Speed And Accuracy

Objectives School-age children with and without parent-reported listening difficulties (LiD) were compared on auditory processing, language, memory, and attention abilities. The objective was to extend what is known so far in the literature about children with LiD by using multiple measures and selective novel measures across the above areas. Design Twenty-six children who were reported by their parents as having LiD and 26 age-matched typically developing children completed clinical tests of auditory processing and multiple measures of language, attention, and memory. All children had normal-range pure-tone hearing thresholds bilaterally. Group differences were examined. Results In addition to significantly poorer speech-perception-in-noise scores, children with LiD had reduced speed and accuracy of word retrieval from long-term memory, poorer short-term memory, sentence recall, and inferencing ability. Statistically significant group differences were of moderate effect size; however, standard test scores of children with LiD were not clinically poor. No statistically significant group differences were observed in attention, working memory capacity, vocabulary, and nonverbal IQ. Conclusions Mild signal-to-noise ratio loss, as reflected by the group mean of children with LiD, supported the children's functional listening problems. In addition, children's relative weakness in select areas of language performance, short-term memory, and long-term memory lexical retrieval speed and accuracy added to previous research on evidence-based areas that need to be evaluated in children with LiD who almost always have heterogenous profiles. Importantly, the functional difficulties faced by children with LiD in relation to their test results indicated, to some extent, that commonly used assessments may not be adequately capturing the children's listening challenges. Supplemental Material https://doi.org/10.23641/asha.12808607

Download Full-text