Temporal Pyramid Pooling Convolutional Neural Network for Cover Song Identification

Recently smoothing deep neural network based classifiers via isotropic Gaussian perturbation is shown to be an effective and scalable way to provide state-of-the-art probabilistic robustness guarantee against ℓ2 norm bounded adversarial perturbations. However, how to train a good base classifier that is accurate and robust when smoothed has not been fully investigated. In this work, we derive a new regularized risk, in which the regularizer can adaptively encourage the accuracy and robustness of the smoothed counterpart when training the base classifier. It is computationally efficient and can be implemented in parallel with other empirical defense methods. We discuss how to implement it under both standard (non-adversarial) and adversarial training scheme. At the same time, we also design a new certification algorithm, which can leverage the regularization effect to provide tighter robustness lower bound that holds with high probability. Our extensive experimentation demonstrates the effectiveness of the proposed training and certification approaches on CIFAR-10 and ImageNet datasets.

Download Full-text

A Robust Cover Song Identification System with Two-Level Similarity Fusion and Post-Processing

Applied Sciences ◽

10.3390/app8081383 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1383 ◽

Cited By ~ 2

Author(s):

Mingyu Li ◽

Ning Chen

Keyword(s):

Information Retrieval ◽

State Of The Art ◽

Identification System ◽

Post Processing ◽

Similarity Functions ◽

Processing Level ◽

Retrieval Scheme ◽

The Common ◽

Cover Song ◽

Music Information

Similarity measurement plays an important role in various information retrieval tasks. In this paper, a music information retrieval scheme based on two-level similarity fusion and post-processing is proposed. At the similarity fusion level, to take full advantage of the common and complementary properties among different descriptors and different similarity functions, first, the track-by-track similarity graphs generated from the same descriptor but different similarity functions are fused with the similarity network fusion (SNF) technique. Then, the obtained first-level fused similarities based on different descriptors are further fused with the mixture Markov model (MMM) technique. At the post-processing level, diffusion is first performed on the two-level fused similarity graph to utilize the underlying track manifold contained within it. Then, a mutual proximity (MP) algorithm is adopted to refine the diffused similarity scores, which helps to reduce the bad influence caused by the “hubness” phenomenon contained in the scores. The performance of the proposed scheme is tested in the cover song identification (CSI) task on three cover song datasets (Covers80, Covers40, and Second Hand Songs (SHS)). The experimental results demonstrate that the proposed scheme outperforms state-of-the-art CSI schemes based on single similarity or similarity fusion.

Download Full-text

Domain Adaptation and Domain Generalization with Representation Learning

10.26686/wgtn.17014700 ◽

2021 ◽

Author(s):

◽

Muhammad Ghifary

Keyword(s):

Neural Network ◽

Object Recognition ◽

Domain Adaptation ◽

State Of The Art ◽

Representation Learning ◽

Training Data ◽

Data Representations ◽

Source Data ◽

Target Environment ◽

Target Data

<p>Machine learning has achieved great successes in the area of computer vision, especially in object recognition or classification. One of the core factors of the successes is the availability of massive labeled image or video data for training, collected manually by human. Labeling source training data, however, can be expensive and time consuming. Furthermore, a large amount of labeled source data may not always guarantee traditional machine learning techniques to generalize well; there is a potential bias or mismatch in the data, i.e., the training data do not represent the target environment. To mitigate the above dataset bias/mismatch, one can consider domain adaptation: utilizing labeled training data and unlabeled target data to develop a well-performing classifier on the target environment. In some cases, however, the unlabeled target data are nonexistent, but multiple labeled sources of data exist. Such situations can be addressed by domain generalization: using multiple source training sets to produce a classifier that generalizes on the unseen target domain. Although several domain adaptation and generalization approaches have been proposed, the domain mismatch in object recognition remains a challenging, open problem – the model performance has yet reached to a satisfactory level in real world applications. The overall goal of this thesis is to progress towards solving dataset bias in visual object recognition through representation learning in the context of domain adaptation and domain generalization. Representation learning is concerned with finding proper data representations or features via learning rather than via engineering by human experts. This thesis proposes several representation learning solutions based on deep learning and kernel methods. This thesis introduces a robust-to-noise deep neural network for handwritten digit classification trained on “clean” images only, which we name Deep Hybrid Network (DHN). DHNs are based on a particular combination of sparse autoencoders and restricted Boltzmann machines. The results show that DHN performs better than the standard deep neural network in recognizing digits with Gaussian and impulse noise, block and border occlusions. This thesis proposes the Domain Adaptive Neural Network (DaNN), a neural network based domain adaptation algorithm that minimizes the classification error and the domain discrepancy between the source and target data representations. The experiments show the competitiveness of DaNN against several state-of-the-art methods on a benchmark object dataset. This thesis develops the Multi-task Autoencoder (MTAE), a domain generalization algorithm based on autoencoders trained via multi-task learning. MTAE learns to transform the original image into its analogs in multiple related domains simultaneously. The results show that the MTAE’s representations provide better classification performance than some alternative autoencoder-based models as well as the current state-of-the-art domain generalization algorithms. This thesis proposes a fast kernel-based representation learning algorithm for both domain adaptation and domain generalization, Scatter Component Analysis (SCA). SCA finds a data representation that trades between maximizing the separability of classes, minimizing the mismatch between domains, and maximizing the separability of the whole data points. The results show that SCA performs much faster than some competitive algorithms, while providing state-of-the-art accuracy in both domain adaptation and domain generalization. Finally, this thesis presents the Deep Reconstruction-Classification Network (DRCN), a deep convolutional network for domain adaptation. DRCN learns to classify labeled source data and also to reconstruct unlabeled target data via a shared encoding representation. The results show that DRCN provides competitive or better performance than the prior state-of-the-art model on several cross-domain object datasets.</p>

Download Full-text

A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription

Electronics ◽

10.3390/electronics10070810 ◽

2021 ◽

Vol 10 (7) ◽

pp. 810

Author(s):

Carlos Hernandez-Olivan ◽

Ignacio Zay Pinilla ◽

Carlos Hernandez-Lopez ◽

Jose R. Beltran

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

High Impact ◽

Critical Problem ◽

Music Transcription ◽

Automatic Music Transcription ◽

Music Information ◽

Method Show

Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. The goal of this work is to address AMT transcription by analyzing how timbre affect monophonic transcription in a first approach based on the CREPE neural network and then to improve the results by performing polyphonic music transcription with different timbres with a second approach based on the Deep Salience model that performs polyphonic transcription based on the Constant-Q Transform. The results of the first method show that the timbre and envelope of the onsets have a high impact on the AMT results and the second method shows that the developed model is less dependent on the strength of the onsets than other state-of-the-art models that deal with AMT on piano sounds such as Google Magenta Onset and Frames (OaF). Our polyphonic transcription model for non-piano instruments outperforms the state-of-the-art model, such as for bass instruments, which has an F-score of 0.9516 versus 0.7102. In our latest experiment we also show how adding an onset detector to our model can outperform the results given in this work.

Download Full-text

User-Driven Fine-Tuning for Beat Tracking

Electronics ◽

10.3390/electronics10131518 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1518

Author(s):

António S. Pinto ◽

Sebastian Böck ◽

Jaime S. Cardoso ◽

Matthew E. P. Davies

Keyword(s):

Neural Network ◽

State Of The Art ◽

Fine Tuning ◽

Temporal Region ◽

Audio Signals ◽

Network Training ◽

Beat Tracking ◽

Improved Performance ◽

Music Information ◽

Musical Audio

The extraction of the beat from musical audio signals represents a foundational task in the field of music information retrieval. While great advances in performance have been achieved due the use of deep neural networks, significant shortcomings still remain. In particular, performance is generally much lower on musical content that differs from that which is contained in existing annotated datasets used for neural network training, as well as in the presence of challenging musical conditions such as rubato. In this paper, we positioned our approach to beat tracking from a real-world perspective where an end-user targets very high accuracy on specific music pieces and for which the current state of the art is not effective. To this end, we explored the use of targeted fine-tuning of a state-of-the-art deep neural network based on a very limited temporal region of annotated beat locations. We demonstrated the success of our approach via improved performance across existing annotated datasets and a new annotation-correction approach for evaluation. Furthermore, we highlighted the ability of content-specific fine-tuning to learn both what is and what is not the beat in challenging musical conditions.

Download Full-text

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5347 ◽

2020 ◽

Vol 34 (01) ◽

pp. 164-172

Author(s):

Sijie Mai ◽

Haifeng Hu ◽

Songlong Xing

Keyword(s):

Neural Network ◽

State Of The Art ◽

Representation Learning ◽

Multimodal Fusion ◽

Multiple Datasets ◽

Multi Stage ◽

Invariant Embedding ◽

Joint Embedding ◽

Adversarial Training ◽

Additional Constraints

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

Download Full-text

Task-Driven Common Representation Learning via Bridge Neural Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015573 ◽

2019 ◽

Vol 33 ◽

pp. 5573-5580

Author(s):

Yao Xu ◽

Xueshuang Xiang ◽

Meiyu Huang

Keyword(s):

Neural Network ◽

Canonical Correlation ◽

State Of The Art ◽

Feature Space ◽

Representation Learning ◽

Data Sources ◽

Common Representation ◽

Total Correlation ◽

Pair Matching ◽

Training Objective

This paper introduces a novel deep learning based method, named bridge neural network (BNN) to dig the potential relationship between two given data sources task by task. The proposed approach employs two convolutional neural networks that project the two data sources into a feature space to learn the desired common representation required by the specific task. The training objective with artificial negative samples is introduced with the ability of mini-batch training and it’s asymptotically equivalent to maximizing the total correlation of the two data sources, which is verified by the theoretical analysis. The experiments on the tasks, including pair matching, canonical correlation analysis, transfer learning, and reconstruction demonstrate the state-of-the-art performance of BNN, which may provide new insights into the aspect of common representation learning.

Download Full-text

Improving the accuracy and robustness of RRAM-based in-memory computing against RRAM hardware noise and adversarial attacks

Semiconductor Science and Technology ◽

10.1088/1361-6641/ac461f ◽

2021 ◽

Author(s):

Sai Kiran Cherupally ◽

Jian Meng ◽

Adnan Siraj Rakin ◽

Shihui Yin ◽

Injune Yeo ◽

...

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

State Of The Art ◽

Black Box ◽

Accuracy Improvement ◽

Training Scheme ◽

Injection Scheme ◽

Noise Injection ◽

Noise Data ◽

The Impact

Abstract We present a novel deep neural network (DNN) training scheme and RRAM in-memory computing (IMC) hardware evaluation towards achieving high robustness to the RRAM device/array variations and adversarial input attacks. We present improved IMC inference accuracy results evaluated on state-of-the-art DNNs including ResNet-18, AlexNet, and VGG with binary, 2-bit, and 4-bit activation/weight precision for the CIFAR-10 dataset. These DNNs are evaluated with measured noise data obtained from three different RRAM-based IMC prototype chips. Across these various DNNs and IMC chip measurements, we show that our proposed hardware noise-aware DNN training consistently improves DNN inference accuracy for actual IMC hardware, up to 8% accuracy improvement for the CIFAR-10 dataset. We also analyze the impact of our proposed noise injection scheme on the adversarial robustness of ResNet-18 DNNs with 1-bit, 2-bit, and 4-bit activation/weight precision. Our results show up to 6% improvement in the robustness to black-box adversarial input attacks.

Download Full-text

Label-Attended Hashing for Multi-Label Image Retrieval

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/133 ◽

2020 ◽

Author(s):

Yanzhao Xie ◽

Yu Liu ◽

Yangtao Wang ◽

Lianli Gao ◽

Peng Wang ◽

...

Keyword(s):

Neural Network ◽

Feature Extraction ◽

Image Retrieval ◽

Hash Function ◽

State Of The Art ◽

Image Representation ◽

Back Propagation ◽

Cauchy Distribution ◽

Representation Learning ◽

Hash Codes

For the multi-label image retrieval, the existing hashing algorithms neglect the dependency between objects and thus fail to capture the attention information in the feature extraction, which affects the precision of hash codes. To address this problem, we explore the inter-dependency between objects through their co-occurrence correlation from the label set and adopt Multi-modal Factorized Bilinear (MFB) pooling component so that the image representation learning can capture this attention information. We propose a Label-Attended Hashing (LAH) algorithm which enables an end-to-end hash model with inter-dependency feature extraction. LAH first combines Convolutional Neural Network (CNN) and Graph Convolution Network (GCN) to separately generate the image representation and label co-occurrence embeddings, then adopts MFB to fuse these two modal vectors, finally learns the hash function with a Cauchy distribution based loss function via back propagation. Extensive experiments on public multi-label datasets demonstrate that (1) LAH can achieve the state-of-the-art retrieval results and (2) the usage of co-occurrence relationship and MFB not only promotes the precision of hash codes but also accelerates the hash learning. GitHub address: https://github.com/IDSM-AI/LAH.

Download Full-text

Domain Adaptation and Domain Generalization with Representation Learning

10.26686/wgtn.17014700.v1 ◽

2021 ◽

Author(s):

◽

Muhammad Ghifary

Keyword(s):

Neural Network ◽

Object Recognition ◽

Domain Adaptation ◽

State Of The Art ◽

Representation Learning ◽

Training Data ◽

Data Representations ◽

Source Data ◽

Target Environment ◽

Target Data

<p>Machine learning has achieved great successes in the area of computer vision, especially in object recognition or classification. One of the core factors of the successes is the availability of massive labeled image or video data for training, collected manually by human. Labeling source training data, however, can be expensive and time consuming. Furthermore, a large amount of labeled source data may not always guarantee traditional machine learning techniques to generalize well; there is a potential bias or mismatch in the data, i.e., the training data do not represent the target environment. To mitigate the above dataset bias/mismatch, one can consider domain adaptation: utilizing labeled training data and unlabeled target data to develop a well-performing classifier on the target environment. In some cases, however, the unlabeled target data are nonexistent, but multiple labeled sources of data exist. Such situations can be addressed by domain generalization: using multiple source training sets to produce a classifier that generalizes on the unseen target domain. Although several domain adaptation and generalization approaches have been proposed, the domain mismatch in object recognition remains a challenging, open problem – the model performance has yet reached to a satisfactory level in real world applications. The overall goal of this thesis is to progress towards solving dataset bias in visual object recognition through representation learning in the context of domain adaptation and domain generalization. Representation learning is concerned with finding proper data representations or features via learning rather than via engineering by human experts. This thesis proposes several representation learning solutions based on deep learning and kernel methods. This thesis introduces a robust-to-noise deep neural network for handwritten digit classification trained on “clean” images only, which we name Deep Hybrid Network (DHN). DHNs are based on a particular combination of sparse autoencoders and restricted Boltzmann machines. The results show that DHN performs better than the standard deep neural network in recognizing digits with Gaussian and impulse noise, block and border occlusions. This thesis proposes the Domain Adaptive Neural Network (DaNN), a neural network based domain adaptation algorithm that minimizes the classification error and the domain discrepancy between the source and target data representations. The experiments show the competitiveness of DaNN against several state-of-the-art methods on a benchmark object dataset. This thesis develops the Multi-task Autoencoder (MTAE), a domain generalization algorithm based on autoencoders trained via multi-task learning. MTAE learns to transform the original image into its analogs in multiple related domains simultaneously. The results show that the MTAE’s representations provide better classification performance than some alternative autoencoder-based models as well as the current state-of-the-art domain generalization algorithms. This thesis proposes a fast kernel-based representation learning algorithm for both domain adaptation and domain generalization, Scatter Component Analysis (SCA). SCA finds a data representation that trades between maximizing the separability of classes, minimizing the mismatch between domains, and maximizing the separability of the whole data points. The results show that SCA performs much faster than some competitive algorithms, while providing state-of-the-art accuracy in both domain adaptation and domain generalization. Finally, this thesis presents the Deep Reconstruction-Classification Network (DRCN), a deep convolutional network for domain adaptation. DRCN learns to classify labeled source data and also to reconstruct unlabeled target data via a shared encoding representation. The results show that DRCN provides competitive or better performance than the prior state-of-the-art model on several cross-domain object datasets.</p>

Download Full-text