Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.

Download Full-text

CrossGR

Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies ◽

10.1145/3448100 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-23

Author(s):

Xinyi Li ◽

Liqiong Chang ◽

Fangfang Song ◽

Ju Wang ◽

Xiaojiang Chen ◽

...

Keyword(s):

Gesture Recognition ◽

Low Cost ◽

User Involvement ◽

Recognition System ◽

Training Data ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Target User ◽

Order Of Magnitude ◽

Training Examples

This paper focuses on a fundamental question in Wi-Fi-based gesture recognition: "Can we use the knowledge learned from some users to perform gesture recognition for others?". This problem is also known as cross-target recognition. It arises in many practical deployments of Wi-Fi-based gesture recognition where it is prohibitively expensive to collect training data from every single user. We present CrossGR, a low-cost cross-target gesture recognition system. As a departure from existing approaches, CrossGR does not require prior knowledge (such as who is currently performing a gesture) of the target user. Instead, CrossGR employs a deep neural network to extract user-agnostic but gesture-related Wi-Fi signal characteristics to perform gesture recognition. To provide sufficient training data to build an effective deep learning model, CrossGR employs a generative adversarial network to automatically generate many synthetic training data from a small set of real-world examples collected from a small number of users. Such a strategy allows CrossGR to minimize the user involvement and the associated cost in collecting training examples for building an accurate gesture recognition system. We evaluate CrossGR by applying it to perform gesture recognition across 10 users and 15 gestures. Experimental results show that CrossGR achieves an accuracy of over 82.6% (up to 99.75%). We demonstrate that CrossGR delivers comparable recognition accuracy, but uses an order of magnitude less training samples collected from the end-users when compared to state-of-the-art recognition systems.

Download Full-text

A Robust Framework for High-Quality Voice Conversion with Conditional Generative Adversarial Network

Communications in Computer and Information Science - Artificial Intelligence and Security ◽

10.1007/978-981-15-8083-3_18 ◽

2020 ◽

pp. 195-205

Author(s):

Liyang Chen ◽

Yingxue Wang ◽

Yifeng Liu ◽

Wendong Xiao ◽

Haiyong Xie

Keyword(s):

Voice Conversion ◽

High Quality ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Robust Framework

Download Full-text

Novel Adaptive Generative Adversarial Network for Voice Conversion

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipaasc47483.2019.9023141 ◽

2019 ◽

Cited By ~ 4

Author(s):

Maitreya Patel ◽

Mihir Parmar ◽

Savan Doshi ◽

Nirmesh J. Shah ◽

Hemant A. Patil

Keyword(s):

Voice Conversion ◽

Generative Adversarial Network ◽

Adversarial Network

Download Full-text

Improving Image-Based Plant Disease Classification With Generative Adversarial Network Under Limited Training Set

Frontiers in Plant Science ◽

10.3389/fpls.2020.583438 ◽

2020 ◽

Vol 11 ◽

Author(s):

Luning Bi ◽

Guiping Hu

Keyword(s):

Classification Accuracy ◽

Plant Disease ◽

Data Augmentation ◽

Plant Diseases ◽

Disease Classification ◽

Training Data ◽

Training Dataset ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Overfitting Problem

Traditionally, plant disease recognition has mainly been done visually by human. It is often biased, time-consuming, and laborious. Machine learning methods based on plant leave images have been proposed to improve the disease recognition process. Convolutional neural networks (CNNs) have been adopted and proven to be very effective. Despite the good classification accuracy achieved by CNNs, the issue of limited training data remains. In most cases, the training dataset is often small due to significant effort in data collection and annotation. In this case, CNN methods tend to have the overfitting problem. In this paper, Wasserstein generative adversarial network with gradient penalty (WGAN-GP) is combined with label smoothing regularization (LSR) to improve the prediction accuracy and address the overfitting problem under limited training data. Experiments show that the proposed WGAN-GP enhanced classification method can improve the overall classification accuracy of plant diseases by 24.4% as compared to 20.2% using classic data augmentation and 22% using synthetic samples without LSR.

Download Full-text

Generating Virtual Short Tau Inversion Recovery (STIR) Images from T1- and T2-Weighted Images Using a Conditional Generative Adversarial Network in Spine Imaging

Diagnostics ◽

10.3390/diagnostics11091542 ◽

2021 ◽

Vol 11 (9) ◽

pp. 1542

Author(s):

Johannes Haubold ◽

Aydin Demircioglu ◽

Jens Matthias Theysohn ◽

Axel Wetter ◽

Alexander Radbruch ◽

...

Keyword(s):

Inversion Recovery ◽

Training Dataset ◽

Validation Dataset ◽

Generative Adversarial Network ◽

Diagnostic Quality ◽

Adversarial Network ◽

Scanning Time ◽

T1 And T2 ◽

Short Tau Inversion Recovery ◽

T2 Weighted Images

Short tau inversion recovery (STIR) sequences are frequently used in magnetic resonance imaging (MRI) of the spine. However, STIR sequences require a significant amount of scanning time. The purpose of the present study was to generate virtual STIR (vSTIR) images from non-contrast, non-fat-suppressed T1- and T2-weighted images using a conditional generative adversarial network (cGAN). The training dataset comprised 612 studies from 514 patients, and the validation dataset comprised 141 studies from 133 patients. For validation, 100 original STIR and respective vSTIR series were presented to six senior radiologists (blinded for the STIR type) in independent A/B-testing sessions. Additionally, for 141 real or vSTIR sequences, the testers were required to produce a structured report of 15 different findings. In the A/B-test, most testers could not reliably identify the real STIR (mean error of tester 1–6: 41%; 44%; 58%; 48%; 39%; 45%). In the evaluation of the structured reports, vSTIR was equivalent to real STIR in 13 of 15 categories. In the category of the number of STIR hyperintense vertebral bodies (p = 0.08) and in the diagnosis of bone metastases (p = 0.055), the vSTIR was only slightly insignificantly equivalent. By virtually generating STIR images of diagnostic quality from T1- and T2-weighted images using a cGAN, one can shorten examination times and increase throughput.

Download Full-text

Training dataset reduction on generative adversarial network

Procedia Computer Science ◽

10.1016/j.procs.2018.10.513 ◽

2018 ◽

Vol 144 ◽

pp. 133-139 ◽

Cited By ~ 4

Author(s):

Fajar Ulin Nuha ◽

Afiahayati

Keyword(s):

Training Dataset ◽

Generative Adversarial Network ◽

Adversarial Network

Download Full-text

Speech Emotion Recognition using Data Augmentation Method by Cycle-Generative Adversarial Networks

10.20944/preprints202104.0651.v1 ◽

2021 ◽

Author(s):

Arash Shilandari ◽

Hossein Marvi ◽

Hossein Khosravi

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Speech Processing ◽

Data Augmentation ◽

Generative Adversarial Networks ◽

Speech Emotion Recognition ◽

Support Vector ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Adversarial Networks

Nowadays, and with the mechanization of life, speech processing has become so crucial for the interaction between humans and machines. Deep neural networks require a database with enough data for training. The more features are extracted from the speech signal, the more samples are needed to train these networks. Adequate training of these networks can be ensured when there is access to sufficient and varied data in each class. If there is not enough data; it is possible to use data augmentation methods to obtain a database with enough samples. One of the obstacles to developing speech emotion recognition systems is the Data sparsity problem in each class for neural network training. The current study has focused on making a cycle generative adversarial network for data augmentation in a system for speech emotion recognition. For each of the five emotions employed, an adversarial generating network is designed to generate data that is very similar to the main data in that class, as well as differentiate the emotions of the other classes. These networks are taught in an adversarial way to produce feature vectors like each class in the space of the main feature, and then they add to the training sets existing in the database to train the classifier network. Instead of using the common cross-entropy error to train generative adversarial networks and to remove the vanishing gradient problem, Wasserstein Divergence has been used to produce high-quality artificial samples. The suggested network has been tested to be applied for speech emotion recognition using EMODB as training, testing, and evaluating sets, and the quality of artificial data evaluated using two Support Vector Machine (SVM) and Deep Neural Network (DNN) classifiers. Moreover, it has been revealed that extracting and reproducing high-level features from acoustic features, speech emotion recognition with separating five primary emotions has been done with acceptable accuracy.

Download Full-text

Generating the Microstructure of Al-Si Cast Alloys Using Machine Learning

Korean Journal of Metals and Materials ◽

10.3365/kjmm.2021.59.11.838 ◽

2021 ◽

Vol 59 (11) ◽

pp. 838-847

Author(s):

In-Kyu Hwang ◽

Hyun-Ji Lee ◽

Sang-Jun Jeong ◽

In-Sung Cho ◽

Hee-Soo Kim

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Training Dataset ◽

Initial Training ◽

Microstructural Characteristics ◽

Generative Adversarial Network ◽

The Real ◽

Adversarial Network ◽

Cast Alloys ◽

The Right

In this study, we constructed a deep convolutional generative adversarial network (DCGAN) to generate the microstructural images that imitate the real microstructures of binary Al-Si cast alloys. We prepared four combinations of alloys, Al-6wt%Si, Al-9wt%Si, Al-12wt%Si and Al-15wt%Si for machine learning. DCGAN is composed of a generator and a discriminator. The discriminator has a typical convolutional neural network (CNN), and the generator has an inverse shaped CNN. The fake images generated using DCGAN were similar to real microstructural images. However, they showed some strange morphology, including dendrites without directionality, and deformed Si crystals. Verification with Inception V3 revealed that the fake images generated using DCGAN were well classified into the target categories. Even the visually imperfect images in the initial training iterations showed high similarity to the target. It seems that the imperfect images had enough microstructural characteristics to satisfy the classification, even though human cannot recognize the images. Cross validation was carried out using real, fake and other test images. When the training dataset had the fake images only, the real and test images showed high similarities to the target categories. When the training dataset contained both the real and fake images, the similarity at the target categories were high enough to meet the right answers. We concluded that the DCGAN developed for microstructural images in this study is highly useful for data augmentation for rare microstructures.

Download Full-text

Fully‑automated deep‑learning segmentation of pediatric cardiovascular magnetic resonance of patients with complex congenital heart diseases

Journal of Cardiovascular Magnetic Resonance ◽

10.1186/s12968-020-00678-0 ◽

2020 ◽

Vol 22 (1) ◽

Author(s):

Saeed Karimi-Bidhendi ◽

Arghavan Arafati ◽

Andrew L. Cheng ◽

Yilei Wu ◽

Arash Kheradvar ◽

...

Keyword(s):

Cardiovascular Magnetic Resonance ◽

Magnetic Resonance ◽

Congenital Heart ◽

Heart Diseases ◽

Training Dataset ◽

Clinical Workflow ◽

Generative Adversarial Network ◽

Convolutional Network ◽

Adversarial Network ◽

Automated Method

Abstract Background For the growing patient population with congenital heart disease (CHD), improving clinical workflow, accuracy of diagnosis, and efficiency of analyses are considered unmet clinical needs. Cardiovascular magnetic resonance (CMR) imaging offers non-invasive and non-ionizing assessment of CHD patients. However, although CMR data facilitates reliable analysis of cardiac function and anatomy, clinical workflow mostly relies on manual analysis of CMR images, which is time consuming. Thus, an automated and accurate segmentation platform exclusively dedicated to pediatric CMR images can significantly improve the clinical workflow, as the present work aims to establish. Methods Training artificial intelligence (AI) algorithms for CMR analysis requires large annotated datasets, which are not readily available for pediatric subjects and particularly in CHD patients. To mitigate this issue, we devised a novel method that uses a generative adversarial network (GAN) to synthetically augment the training dataset via generating synthetic CMR images and their corresponding chamber segmentations. In addition, we trained and validated a deep fully convolutional network (FCN) on a dataset, consisting of $$64$$ 64 pediatric subjects with complex CHD, which we made publicly available. Dice metric, Jaccard index and Hausdorff distance as well as clinically-relevant volumetric indices are reported to assess and compare our platform with other algorithms including U-Net and cvi42, which is used in clinics. Results For congenital CMR dataset, our FCN model yields an average Dice metric of $$91.0\mathrm{\%}$$ 91.0 % and $$86.8\mathrm{\%}$$ 86.8 % for LV at end-diastole and end-systole, respectively, and $$84.7\mathrm{\%}$$ 84.7 % and $$80.6\mathrm{\%}$$ 80.6 % for RV at end-diastole and end-systole, respectively. Using the same dataset, the cvi42, resulted in $$73.2\mathrm{\%}$$ 73.2 % , $$71.0\mathrm{\%}$$ 71.0 % , $$54.3\mathrm{\%}$$ 54.3 % and $$53.7\mathrm{\%}$$ 53.7 % for LV and RV at end-diastole and end-systole, and the U-Net architecture resulted in $$87.4\mathrm{\%}$$ 87.4 % , $$83.9\mathrm{\%}$$ 83.9 % , $$81.8\mathrm{\%}$$ 81.8 % and $$74.8\mathrm{\%}$$ 74.8 % for LV and RV at end-diastole and end-systole, respectively. Conclusions The chambers’ segmentation results from our fully-automated method showed strong agreement with manual segmentation and no significant statistical difference was found by two independent statistical analyses. Whereas cvi42 and U-Net segmentation results failed to pass the t-test. Relying on these outcomes, it can be inferred that by taking advantage of GANs, our method is clinically relevant and can be used for pediatric and congenital CMR segmentation and analysis.

Download Full-text

Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

IEEE Access ◽

10.1109/access.2021.3063519 ◽

2021 ◽

Vol 9 ◽

pp. 47503-47515

Author(s):

Tuan Vu Ho ◽

Masato Akagi

Keyword(s):

Voice Conversion ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Variational Autoencoder ◽

Cross Lingual

Download Full-text