Incorporating dynamic features into minimum generation error training for HMM-based speech synthesis

AbstractDetection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.

Download Full-text

Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 Representation

Applied Sciences ◽

10.3390/app10186381 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6381 ◽

Cited By ~ 1

Author(s):

Pongsathon Janyoi ◽

Pusadee Seresangtakul

Keyword(s):

Neural Network ◽

Deep Learning ◽

Recurrent Neural Network ◽

Speech Synthesis ◽

Critical Factor ◽

Dynamic Features ◽

Linguistic Features ◽

Proposed Model ◽

F0 Contour ◽

Tonal Contour

The modeling of fundamental frequency (F0) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F0 for Isarn speech synthesis. We propose the F0 model for this based on a recurrent neural network (RNN). Sampled values of F0 are used at the syllable level of continuous Isarn speech combined with their dynamic features to represent supra-segmental properties of the F0 contour. Different architectures of the deep RNNs and different combinations of linguistic features are analyzed to obtain conditions for the best performance. To assess the proposed method, we compared it with several RNN-based baselines. The results of objective and subjective tests indicate that the proposed model significantly outperformed the baseline RNN model that predicts values of F0 at the frame level, and the baseline RNN model that represents the F0 contours of syllables by using discrete cosine transform.

Download Full-text

F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/6/9 ◽

2020 ◽

Vol 17 (6) ◽

pp. 906-915

Author(s):

Pongsathon Janyoi ◽

Pusadee Seresangtakul

Keyword(s):

Speech Synthesis ◽

Deep Neural Networks ◽

Markov Models ◽

Feature Representation ◽

Context Dependency ◽

Dynamic Features ◽

Synthesis System ◽

Proposed Model ◽

Training Sets

The generation of the fundamental frequency (F0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F0 is predicted frame-by-frame. This method is insufficient to represent F0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F0 model that represents F0 contours within syllables, using syllable-level F0 parameters that comprise the sampling F0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F0 values at the frame level

Download Full-text