A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription

Carlos Hernandez-Olivan; Ignacio Zay Pinilla; Carlos Hernandez-Lopez; Jose R. Beltran

doi:10.3390/electronics10070810

A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription

Electronics ◽

10.3390/electronics10070810 ◽

2021 ◽

Vol 10 (7) ◽

pp. 810

Author(s):

Carlos Hernandez-Olivan ◽

Ignacio Zay Pinilla ◽

Carlos Hernandez-Lopez ◽

Jose R. Beltran

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

High Impact ◽

Critical Problem ◽

Music Transcription ◽

Automatic Music Transcription ◽

Music Information ◽

Method Show

Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. The goal of this work is to address AMT transcription by analyzing how timbre affect monophonic transcription in a first approach based on the CREPE neural network and then to improve the results by performing polyphonic music transcription with different timbres with a second approach based on the Deep Salience model that performs polyphonic transcription based on the Constant-Q Transform. The results of the first method show that the timbre and envelope of the onsets have a high impact on the AMT results and the second method shows that the developed model is less dependent on the strength of the onsets than other state-of-the-art models that deal with AMT on piano sounds such as Google Magenta Onset and Frames (OaF). Our polyphonic transcription model for non-piano instruments outperforms the state-of-the-art model, such as for bass instruments, which has an F-score of 0.9516 versus 0.7102. In our latest experiment we also show how adding an onset detector to our model can outperform the results given in this work.

Download Full-text

Interpolation Consistency Training for Semi-supervised Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/504 ◽

2019 ◽

Cited By ~ 39

Author(s):

Vikas Verma ◽

Alex Lamb ◽

Juho Kannala ◽

Yoshua Bengio ◽

David Lopez-Paz

Keyword(s):

Neural Network ◽

Neural Networks ◽

Supervised Learning ◽

Deep Neural Networks ◽

State Of The Art ◽

Data Distribution ◽

Network Architectures ◽

Low Density ◽

Decision Boundary ◽

Classification Problems

We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark dataset.

Download Full-text

Automatic onset detection using convolutional neural networks

10.5753/sbcm.2019.10446 ◽

2019 ◽

Author(s):

Willy Cornelissen ◽

Maurício Loureiro

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Ground Truth ◽

Onset Detection ◽

Ground Truth Data ◽

Music Transcription ◽

Automatic Music Transcription ◽

Information Research ◽

Music Research ◽

Music Information

A very significant task for music research is to estimate instants when meaningful events begin (onset) and when they end (offset). Onset detection is widely applied in many fields: electrocardiograms, seismographic data, stock market results and many Music Information Research(MIR) tasks, such as Automatic Music Transcription, Rhythm Detection, Speech Recognition, etc. Automatic Onset Detection(AOD) received, recently, a huge contribution coming from Artificial Intelligence (AI) methods, mainly Machine Learning and Deep Learning. In this work, the use of Convolutional Neural Networks (CNN) is explored by adapting its original architecture in order to apply the approach to automatic onset detection on audio musical signals. We used a CNN network for onset detection on a very general dataset, well acknowledged by the MIR community, and examined the accuracy of the method by comparison to ground truth data published by the dataset. The results are promising and outperform another methods of musical onset detection.

Download Full-text

Modular Dynamic Neural Network: A Continual Learning Architecture

Applied Sciences ◽

10.3390/app112412078 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12078

Author(s):

Daniel Turner ◽

Pedro J. S. Cardoso ◽

João M. F. Rodrigues

Keyword(s):

Neural Network ◽

Neural Networks ◽

Feature Extraction ◽

Deep Neural Networks ◽

State Of The Art ◽

Simple Task ◽

Dynamic Neural Network ◽

Main Components ◽

Over Time ◽

Continual Learning

Learning to recognize a new object after having learned to recognize other objects may be a simple task for a human, but not for machines. The present go-to approaches for teaching a machine to recognize a set of objects are based on the use of deep neural networks (DNN). So, intuitively, the solution for teaching new objects on the fly to a machine should be DNN. The problem is that the trained DNN weights used to classify the initial set of objects are extremely fragile, meaning that any change to those weights can severely damage the capacity to perform the initial recognitions; this phenomenon is known as catastrophic forgetting (CF). This paper presents a new (DNN) continual learning (CL) architecture that can deal with CF, the modular dynamic neural network (MDNN). The presented architecture consists of two main components: (a) the ResNet50-based feature extraction component as the backbone; and (b) the modular dynamic classification component, which consists of multiple sub-networks and progressively builds itself up in a tree-like structure that rearranges itself as it learns over time in such a way that each sub-network can function independently. The main contribution of the paper is a new architecture that is strongly based on its modular dynamic training feature. This modular structure allows for new classes to be added while only altering specific sub-networks in such a way that previously known classes are not forgotten. Tests on the CORe50 dataset showed results above the state of the art for CL architectures.

Download Full-text

ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget

IoT ◽

10.3390/iot2020012 ◽

2021 ◽

Vol 2 (2) ◽

pp. 222-235

Author(s):

Guillaume Coiffier ◽

Ghouthi Boukli Hacene ◽

Vincent Gripon

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Convolutional Neural Network ◽

Spatial Resolution ◽

Network Architecture ◽

Deep Neural Networks ◽

State Of The Art ◽

Feature Maps ◽

Neural Network Architecture

Deep Neural Networks are state-of-the-art in a large number of challenges in machine learning. However, to reach the best performance they require a huge pool of parameters. Indeed, typical deep convolutional architectures present an increasing number of feature maps as we go deeper in the network, whereas spatial resolution of inputs is decreased through downsampling operations. This means that most of the parameters lay in the final layers, while a large portion of the computations are performed by a small fraction of the total parameters in the first layers. In an effort to use every parameter of a network at its maximum, we propose a new convolutional neural network architecture, called ThriftyNet. In ThriftyNet, only one convolutional layer is defined and used recursively, leading to a maximal parameter factorization. In complement, normalization, non-linearities, downsamplings and shortcut ensure sufficient expressivity of the model. ThriftyNet achieves competitive performance on a tiny parameters budget, exceeding 91% accuracy on CIFAR-10 with less than 40 k parameters in total, 74.3% on CIFAR-100 with less than 600 k parameters, and 67.1% On ImageNet ILSVRC 2012 with no more than 4.15 M parameters. However, the proposed method typically requires more computations than existing counterparts.

Download Full-text

Towards a high robust neural network via feature matching

International Journal of Multimedia Information Retrieval ◽

10.1007/s13735-021-00219-0 ◽

2021 ◽

Author(s):

Jian Li ◽

Yanming Guo ◽

Songyang Lao ◽

Yulun Wu ◽

Liang Bai ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Neural Networks ◽

Feature Matching ◽

Feature Vector ◽

State Of The Art ◽

Model Performance ◽

Image Features ◽

Classification Systems ◽

Adversarial Attack

AbstractImage classification systems have been found vulnerable to adversarial attack, which is imperceptible to human but can easily fool deep neural networks. Recent researches indicate that regularizing the network by introducing randomness could greatly improve the model’s robustness against adversarial attack, but the randomness module would normally involve complex calculations and numerous additional parameters and seriously affect the model performance on clean data. In this paper, we propose a feature matching module to regularize the network. Specifically, our model learns a feature vector for each category and imposes additional restrictions on image features. Then, the similarity between image features and category features is used as the basis for classification. Our method does not introduce any additional network parameters than undefended model and can be easily integrated into any neural network. Experiments on the CIFAR10 and SVHN datasets highlight that our proposed module can effectively improve both clean data and perturbed data accuracy in comparison with the state-of-the-art defense methods and outperform the L2P method by 6.3$$\%$$ % , 24$$\%$$ % on clean and perturbed data, respectively, using ResNet-V2(18) architecture.

Download Full-text

Tri-net for Semi-Supervised Deep Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/278 ◽

2018 ◽

Cited By ~ 11

Author(s):

Dong-Dong Chen ◽

Wei Wang ◽

Wei Gao ◽

Zhi-Hua Zhou

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Error Rate ◽

Deep Neural Network ◽

Deep Neural Networks ◽

State Of The Art ◽

Fine Tuning ◽

Learning Methods ◽

Model Initialization

Deep neural networks have witnessed great successes in various real applications, but it requires a large number of labeled data for training. In this paper, we propose tri-net, a deep neural network which is able to use massive unlabeled data to help learning with limited labeled data. We consider model initialization, diversity augmentation and pseudo-label editing simultaneously. In our work, we utilize output smearing to initialize modules, use fine-tuning on labeled data to augment diversity and eliminate unstable pseudo-labels to alleviate the influence of suspicious pseudo-labeled data. Experiments show that our method achieves the best performance in comparison with state-of-the-art semi-supervised deep learning methods. In particular, it achieves 8.30% error rate on CIFAR-10 by using only 4000 labeled examples.

Download Full-text

The FaceChannel: A Fast and Furious Deep Neural Network for Facial Expression Recognition

SN Computer Science ◽

10.1007/s42979-020-00325-6 ◽

2020 ◽

Vol 1 (6) ◽

Author(s):

Pablo Barros ◽

Nikhil Churamani ◽

Alessandra Sciutti

Keyword(s):

Neural Network ◽

Neural Networks ◽

Facial Expression ◽

Facial Expression Recognition ◽

Deep Neural Networks ◽

State Of The Art ◽

Facial Features ◽

Expression Recognition ◽

Current State ◽

Benchmark Datasets

AbstractCurrent state-of-the-art models for automatic facial expression recognition (FER) are based on very deep neural networks that are effective but rather expensive to train. Given the dynamic conditions of FER, this characteristic hinders such models of been used as a general affect recognition. In this paper, we address this problem by formalizing the FaceChannel, a light-weight neural network that has much fewer parameters than common deep neural networks. We introduce an inhibitory layer that helps to shape the learning of facial features in the last layer of the network and, thus, improving performance while reducing the number of trainable parameters. To evaluate our model, we perform a series of experiments on different benchmark datasets and demonstrate how the FaceChannel achieves a comparable, if not better, performance to the current state-of-the-art in FER. Our experiments include cross-dataset analysis, to estimate how our model behaves on different affective recognition conditions. We conclude our paper with an analysis of how FaceChannel learns and adapts the learned facial features towards the different datasets.

Download Full-text

Leveraging the Bhattacharyya coefficient for uncertainty quantification in deep neural networks

Neural Computing and Applications ◽

10.1007/s00521-021-05789-y ◽

2021 ◽

Author(s):

Pieter Van Molle ◽

Tim Verbelen ◽

Bert Vankeirsbilck ◽

Jonas De Vylder ◽

Bart Diricx ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

Use Case ◽

Bhattacharyya Coefficient ◽

Output Uncertainty ◽

Novel Approach ◽

Benchmark Datasets ◽

Network Approaches

AbstractModern deep learning models achieve state-of-the-art results for many tasks in computer vision, such as image classification and segmentation. However, its adoption into high-risk applications, e.g. automated medical diagnosis systems, happens at a slow pace. One of the main reasons for this is that regular neural networks do not capture uncertainty. To assess uncertainty in classification, several techniques have been proposed casting neural network approaches in a Bayesian setting. Amongst these techniques, Monte Carlo dropout is by far the most popular. This particular technique estimates the moments of the output distribution through sampling with different dropout masks. The output uncertainty of a neural network is then approximated as the sample variance. In this paper, we highlight the limitations of such a variance-based uncertainty metric and propose an novel approach. Our approach is based on the overlap between output distributions of different classes. We show that our technique leads to a better approximation of the inter-class output confusion. We illustrate the advantages of our method using benchmark datasets. In addition, we apply our metric to skin lesion classification—a real-world use case—and show that this yields promising results.

Download Full-text

Timbre Comparison in Note Tracking from Onset, Frames and Pitch Estimation

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jjii3a.4872 ◽

2020 ◽

Vol 8 ◽

Author(s):

Carlos Hernández Oliván ◽

Ignacio Zay Pinilla ◽

José Ramón Beltrán Blázquez

Keyword(s):

Information Retrieval ◽

Music Information Retrieval ◽

Tracking Algorithm ◽

Pitch Detection ◽

Critical Problem ◽

Pitch Estimation ◽

Music Transcription ◽

Automatic Music Transcription ◽

Music Information

Note Tracking (NT) is a subtask of Automatic Music Transcription (AMT) which is a critical problem in the field of Music Information Retrieval (MIR). The aim of this work is to compare the performance of two models, one for onsets and frames prediction and another one with pitch detection and a note tracking algorithm in order to study the behaviour of different timbres and families of instruments in note tracking subtasks.

Download Full-text

HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015297 ◽

2019 ◽

Vol 33 ◽

pp. 5297-5304 ◽

Cited By ~ 4

Author(s):

Shipeng Wang ◽

Jian Sun ◽

Zongben Xu

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

Black Box ◽

Research Topic ◽

Decay Rates ◽

Adaptive Combination ◽

Network Training ◽

Stochastic Optimization Algorithms

Deep neural networks are traditionally trained using humandesigned stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as HyperAdam, is proposed that combines the idea of “learning to optimize” and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates . The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.

Download Full-text