Tibetan speech synthesis based on an improved neural network

MATEC Web of Conferences ◽

10.1051/matecconf/202133606012 ◽

2021 ◽

Vol 336 ◽

pp. 06012

Author(s):

Yuntao Ding ◽

Rangzhuoma Cai ◽

Baojia Gong

Keyword(s):

Neural Network ◽

Experimental Data ◽

Speech Synthesis ◽

Synthesis Method ◽

Attention Mechanism ◽

Linear Projection ◽

Convolution Operation

Nowadays, Tibetan speech synthesis based on neural network has become the mainstream synthesis method. Among them, the griffin-lim vocoder is widely used in Tibetan speech synthesis because of its relatively simple synthesis.Aiming at the problem of low fidelity of griffin-lim vocoder, this paper uses WaveNet vocoder instead of griffin-lim for Tibetan speech synthesis.This paper first uses convolution operation and attention mechanism to extract sequence features.And then uses linear projection and feature amplification module to predict mel spectrogram.Finally, use WaveNet vocoder to synthesize speech waveform. Experimental data shows that our model has a better performance in Tibetan speech synthesis.

Download Full-text

Gated Recurrent Attention for Multi-Style Speech Synthesis

Applied Sciences ◽

10.3390/app10155325 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5325

Author(s):

Sung Jun Cheon ◽

Joun Yeop Lee ◽

Byoung Jin Choi ◽

Hyeonseung Lee ◽

Nam Soo Kim

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Attention Mechanism ◽

Training Data ◽

Attention Model ◽

Synthesis Techniques ◽

Listening Tests ◽

End To End ◽

And Control

End-to-end neural network-based speech synthesis techniques have been developed to represent and synthesize speech in various prosodic style. Although the end-to-end techniques enable the transfer of a style with a single vector of style representation, it has been reported that the speaker similarity observed from the speech synthesized with unseen speaker-style is low. One of the reasons for this problem is that the attention mechanism in the end-to-end model is overfitted to the training data. To learn and synthesize voices of various styles, an attention mechanism that can preserve longer-term context and control the context is required. In this paper, we propose a novel attention model which employs gates to control the recurrences in the attention. To verify the proposed attention’s style modeling capability, perceptual listening tests were conducted. The experiments show that the proposed attention outperforms the location-sensitive attention in both similarity and naturalness.

Download Full-text

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6337 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8228-8235

Author(s):

Naihan Li ◽

Yanqing Liu ◽

Yu Wu ◽

Shujie Liu ◽

Sheng Zhao ◽

...

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Neural Model ◽

Attention Mechanism ◽

Maximum Length ◽

Prosodic Features ◽

Text To Speech ◽

Linguistic Features ◽

Excellent Quality ◽

Speech Model

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

Download Full-text

A fuzzy neural network approach for modeling the growth kinetics of FeB and Fe2B layers during the boronizing process

Matériaux & Techniques ◽

10.1051/mattech/2019002 ◽

2018 ◽

Vol 106 (6) ◽

pp. 603 ◽

Cited By ~ 2

Author(s):

Bendaoud Mebarek ◽

Mourad Keddam

Keyword(s):

Neural Network ◽

Experimental Data ◽

Fuzzy Neural Network ◽

Treatment Time ◽

Calculation Model ◽

Average Error ◽

Network Approach ◽

Neural Network Approach ◽

Fuzzy Neural ◽

Kinetics Of

In this paper, we develop a boronizing process simulation model based on fuzzy neural network (FNN) approach for estimating the thickness of the FeB and Fe2B layers. The model represents a synthesis of two artificial intelligence techniques; the fuzzy logic and the neural network. Characteristics of the fuzzy neural network approach for the modelling of boronizing process are presented in this study. In order to validate the results of our calculation model, we have used the learning base of experimental data of the powder-pack boronizing of Fe-15Cr alloy in the temperature range from 800 to 1050 °C and for a treatment time ranging from 0.5 to 12 h. The obtained results show that it is possible to estimate the influence of different process parameters. Comparing the results obtained by the artificial neural network to experimental data, the average error generated from the fuzzy neural network was 3% for the FeB layer and 3.5% for the Fe2B layer. The results obtained from the fuzzy neural network approach are in agreement with the experimental data. Finally, the utilization of fuzzy neural network approach is well adapted for the boronizing kinetics of Fe-15Cr alloy.

Download Full-text