scholarly journals Singing Voice Detection: A Survey

Entropy ◽  
2022 ◽  
Vol 24 (1) ◽  
pp. 114
Author(s):  
Ramy Monir ◽  
Daniel Kostrzewa ◽  
Dariusz Mrozek

Singing voice detection or vocal detection is a classification task that determines whether there is a singing voice in a given audio segment. This process is a crucial preprocessing step that can be used to improve the performance of other tasks such as automatic lyrics alignment, singing melody transcription, singing voice separation, vocal melody extraction, and many more. This paper presents a survey on the techniques of singing voice detection with a deep focus on state-of-the-art algorithms such as convolutional LSTM and GRU-RNN. It illustrates a comparison between existing methods for singing voice detection, mainly based on the Jamendo and RWC datasets. Long-term recurrent convolutional networks have reached impressive results on public datasets. The main goal of the present paper is to investigate both classical and state-of-the-art approaches to singing voice detection.

Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1458
Author(s):  
Xulong Zhang ◽  
Yi Yu ◽  
Yongwei Gao ◽  
Xi Chen ◽  
Wei Li

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.


2021 ◽  
Vol 11 (24) ◽  
pp. 11838
Author(s):  
Wenming Gui ◽  
Yukun Li ◽  
Xian Zang ◽  
Jinglan Zhang

Singing voice detection is still a challenging task because the voice can be obscured by instruments having the same frequency band, and even the same timbre, produced by mimicking the mechanism of human singing. Because of the poor adaptability and complexity of feature engineering, there is a recent trend towards feature learning in which deep neural networks play the roles of feature extraction and classification. In this paper, we present two methods to explore the channel properties in the convolution neural network to improve the performance of singing voice detection by feature learning. First, channel attention learning is presented to measure the importance of a feature, in which two attention mechanisms are exploited, i.e., the scaled dot-product and squeeze-and-excitation. This method focuses on learning the importance of the feature map so that the neurons can place more attention on the more important feature maps. Second, the multi-scale representations are fed to the input channels, aiming at adding more information in terms of scale. Generally, different songs need different scales of a spectrogram to be represented, and multi-scale representations ensure the network can choose the best one for the task. In the experimental stage, we proved the effectiveness of the two methods based on three public datasets, with the accuracy performance increasing by up to 2.13 percent compared to its already high initial level.


Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 298
Author(s):  
Yongwei Gao ◽  
Xulong Zhang ◽  
Wei Li

Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of a singing voice. Therefore, reducing the interference of accompaniment is beneficial to pitch estimation of the singing voice. In this paper, we first adopted a high-resolution network (HRNet) to separate vocals from polyphonic music, then designed an encoder-decoder network to estimate the vocal F0 values. Experiment results demonstrate that the effectiveness of the HRNet-based singing voice separation method in reducing the interference of accompaniment on the extraction of vocal melody, and the proposed vocal melody extraction (VME) system outperforms other state-of-the-art algorithms in most cases.


2020 ◽  
Author(s):  
Amir Hosein Safari ◽  
Nafiseh Sedaghat ◽  
Alpha Forna ◽  
Hooman Zabeti ◽  
Leonid Chindelevitch ◽  
...  

AbstractDrug resistance in Mycobacterium tuberculosis (MTB) may soon be a leading worldwide cause of death. One way to mitigate the risk of drug resistance is through methods that predict drug resistance in MTB using whole-genome sequencing (WGS) data. Existing machine learning methods for this task featurize the WGS data from a given bacterial isolate by defining one input feature per SNP. Here, we introduce a gene-centric method for predicting drug resistance in TB. We define one feature per gene according to the number of mutations in that gene in a give isolate. This representation greatly decreases the number of model parameters. We further propose a model that considers both gene order through a Long-term Recurrent Convolutional Network (LRCN) architecture, which combines convolutional and recurrent layers. We find that using these strategies yields a substantial, statistically-significant improvement over the state-of-the-art, and that this improvement is driven by the order of genes in the genome and their organization into operons.


2018 ◽  
Vol 14 (12) ◽  
pp. 1915-1960 ◽  
Author(s):  
Rudolf Brázdil ◽  
Andrea Kiss ◽  
Jürg Luterbacher ◽  
David J. Nash ◽  
Ladislava Řezníčková

Abstract. The use of documentary evidence to investigate past climatic trends and events has become a recognised approach in recent decades. This contribution presents the state of the art in its application to droughts. The range of documentary evidence is very wide, including general annals, chronicles, memoirs and diaries kept by missionaries, travellers and those specifically interested in the weather; records kept by administrators tasked with keeping accounts and other financial and economic records; legal-administrative evidence; religious sources; letters; songs; newspapers and journals; pictographic evidence; chronograms; epigraphic evidence; early instrumental observations; society commentaries; and compilations and books. These are available from many parts of the world. This variety of documentary information is evaluated with respect to the reconstruction of hydroclimatic conditions (precipitation, drought frequency and drought indices). Documentary-based drought reconstructions are then addressed in terms of long-term spatio-temporal fluctuations, major drought events, relationships with external forcing and large-scale climate drivers, socio-economic impacts and human responses. Documentary-based drought series are also considered from the viewpoint of spatio-temporal variability for certain continents, and their employment together with hydroclimate reconstructions from other proxies (in particular tree rings) is discussed. Finally, conclusions are drawn, and challenges for the future use of documentary evidence in the study of droughts are presented.


2021 ◽  
Vol 11 (15) ◽  
pp. 6975
Author(s):  
Tao Zhang ◽  
Lun He ◽  
Xudong Li ◽  
Guoqing Feng

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.


Biomedicines ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 844
Author(s):  
Armando Tripodi

Lupus anticoagulant (LA) is one of the three laboratory parameters (the others being antibodies to either cardiolipin or β2-glycoprotein I) which defines the rare but potentially devastating condition known as antiphospholipid syndrome (APS). Testing for LA is a challenging task for the clinical laboratory because specific tests for its detection are not available. However, proper LA detection is paramount for patients’ management, as its persistent positivity in the presence of (previous or current) thrombotic events, candidate for long term anticoagulation. Guidelines for LA detection have been established and updated over the last two decades. Implementation of these guidelines across laboratories and participation to external quality assessment schemes are required to help standardize the diagnostic procedures and help clinicians for appropriate management of APS. This article aims to review the current state of the art and the challenges that clinical laboratories incur in the detection of LA.


2021 ◽  
Vol 10 (8) ◽  
pp. 523
Author(s):  
Nicholus Mboga ◽  
Stefano D’Aronco ◽  
Tais Grippa ◽  
Charlotte Pelletier ◽  
Stefanos Georganos ◽  
...  

Multitemporal environmental and urban studies are essential to guide policy making to ultimately improve human wellbeing in the Global South. Land-cover products derived from historical aerial orthomosaics acquired decades ago can provide important evidence to inform long-term studies. To reduce the manual labelling effort by human experts and to scale to large, meaningful regions, we investigate in this study how domain adaptation techniques and deep learning can help to efficiently map land cover in Central Africa. We propose and evaluate a methodology that is based on unsupervised adaptation to reduce the cost of generating reference data for several cities and across different dates. We present the first application of domain adaptation based on fully convolutional networks for semantic segmentation of a dataset of historical panchromatic orthomosaics for land-cover generation for two focus cities Goma-Gisenyi and Bukavu. Our experimental evaluation shows that the domain adaptation methods can reach an overall accuracy between 60% and 70% for different regions. If we add a small amount of labelled data from the target domain, too, further performance gains can be achieved.


Cybersecurity ◽  
2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Shushan Arakelyan ◽  
Sima Arasteh ◽  
Christophe Hauser ◽  
Erik Kline ◽  
Aram Galstyan

AbstractTackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results, and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).


2021 ◽  
Vol 10 (3) ◽  
pp. 154
Author(s):  
Robert Jeansoulin

Providing long-term data about the evolution of railway networks in Europe may help us understand how European Union (EU) member states behave in the long-term, and how they can comply with present EU recommendations. This paper proposes a methodology for collecting data about railway stations, at the maximal extent of the French railway network, a century ago.The expected outcome is a geocoded dataset of French railway stations (gares), which: (a) links gares to each other, (b) links gares with French communes, the basic administrative level for statistical information. Present stations are well documented in public data, but thousands of past stations are sparsely recorded, not geocoded, and often ignored, except in volunteer geographic information (VGI), either collaboratively through Wikipedia or individually. VGI is very valuable in keeping track of that heritage, and remote sensing, including aerial photography is often the last chance to obtain precise locations. The approach is a series of steps: (1) meta-analysis of the public datasets, (2) three-steps fusion: measure-decision-combination, between public datasets, (3) computer-assisted geocoding for ‘gares’ where fusion fails, (4) integration of additional gares gathered from VGI, (5) automated quality control, indicating where quality is questionable. These five families of methods, form a comprehensive computer-assisted reconstruction process (CARP), which constitutes the core of this paper. The outcome is a reliable dataset—in geojson format under open license—encompassing (by January 2021) more than 10,700 items linked to about 7500 of the 35,500 communes of France: that is 60% more than recorded before. This work demonstrates: (a) it is possible to reconstruct transport data from the past, at a national scale; (b) the value of remote sensing and of VGI is considerable in completing public sources from an historical perspective; (c) data quality can be monitored all along the process and (d) the geocoded outcome is ready for a large variety of further studies with statistical data (demography, density, space coverage, CO2 simulation, environmental policies, etc.).


Sign in / Sign up

Export Citation Format

Share Document