Artificial Generation of Realistic Voices

In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input text data which gets synthesized, variated, and altered into artificial voice at the output end. To create a text-to-speech model, that is, a model capable of generating speech with the help of trained datasets. It follows a process which organizes the entire function to present the output sequence in three parts. These three parts are Speaker Encoder, Synthesizer, and Vocoder. Subsequently, using datasets, the model accomplishes generation of voice with prior training and maintains the naturalness of speech throughout. For naturalness of speech we implement a zero-shot adaption technique. The primary capability of the model is to provide the ability of regeneration of voice, which has a variety of applications in the advancement of the domain of speech synthesis. With the help of speaker encoder, our model synthesizes user generated voice if the user wants the output trained on his/her voice which is feeded through the mic, present in GUI. Regeneration capabilities lie within the domain Voice Regeneration which generates similar voice waveforms for any text.

Download Full-text

End-to-End Text-To-Speech synthesis for under resourced South African languages

2020 International SAUPEC/RobMech/PRASA Conference ◽

10.1109/saupec/robmech/prasa48453.2020.9041030 ◽

2020 ◽

Author(s):

Thapelo Nthite ◽

Mohohlo Tsoeu

Keyword(s):

South African ◽

Speech Synthesis ◽

Text To Speech ◽

African Languages ◽

Text To Speech Synthesis ◽

End To End

Download Full-text

Myanmar Text-to-Speech Synthesis Using End-to-End Model

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443295 ◽

2020 ◽

Author(s):

Qinglai Qin ◽

Jian Yang ◽

Peiying Li

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis ◽

End To End

Download Full-text

Text Normalization for Telugu Text-to-Speech Synthesis

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v11i2.1176 ◽

2013 ◽

Vol 11 (2) ◽

pp. 2241-2249

Author(s):

Dr. K.V.N. Sunitha ◽

P.Sunitha Devi

Keyword(s):

Speech Synthesis ◽

Text Processing ◽

Text To Speech ◽

Speech Technology ◽

Rule Based System ◽

Input Text ◽

Novel Approach ◽

Text To Speech Synthesis ◽

Processing Component ◽

Text Normalization

Most areas related to language and speech technology, directly or indirectly, require handling of unrestricted text, and Text-to-speech systems directly need to work on real text. To build a natural sounding speech synthesis system, it is essential that the text processing component produce an appropriate sequence of phonemic units corresponding to an arbitrary input text. A novel approach is used, where the input text is tokenized, and classification is done based on token type. The token sense disambiguation is achieved by the semantic nature of the language and then the expansion rules are applied to get the normalized text. However, for Telugu language not much work is done on text normalization. In this paper we discuss our efforts for designing a rule based system to achieve text normalization in the context of building Telugu text-to-speech system.

Download Full-text

Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System

IEEE Transactions on Audio Speech and Language Processing ◽

10.1109/tasl.2009.2023161 ◽

2009 ◽

Vol 17 (8) ◽

pp. 1567-1576 ◽

Cited By ~ 10

Author(s):

Zhiyong Wu ◽

H.M. Meng ◽

Hongwu Yang ◽

Lianhong Cai

Keyword(s):

Chinese Text ◽

Speech Synthesis ◽

Text To Speech ◽

Input Text ◽

Dialog System ◽

Spoken Dialog System ◽

Text To Speech Synthesis

Download Full-text

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Information ◽

10.3390/info10040131 ◽

2019 ◽

Vol 10 (4) ◽

pp. 131 ◽

Cited By ~ 1

Author(s):

Yifan Liu ◽

Jin Zheng

Keyword(s):

Speech Synthesis ◽

Computational Technique ◽

Navigation Systems ◽

Text To Speech ◽

Automatic Translation ◽

Synthesized Speech ◽

Text To Speech Synthesis ◽

End To End ◽

The Individual ◽

Additional Prediction

Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer. In recent years, speech synthesis techniques have developed, and have been employed in many applications, such as automatic translation applications and car navigation systems. End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust. Tacotron 2 is an integrated state-of-the-art end-to-end speech synthesis system that can directly predict closed-to-natural human speech from raw text. However, there remains a gap between synthesized speech and natural speech. Suffering from an over-smoothness problem, Tacotron 2 produced ’averaged’ speech, making the synthesized speech sounds unnatural and inflexible. In this work, we first propose an estimated network (Es-Network), which captures general features from a raw mel spectrogram in an unsupervised manner. Then, we design Es-Tacotron2 by employing the Es-Network to calculate the estimated mel spectrogram residual, and setting it as an additional prediction task of Tacotron 2, to allow the model focus more on predicting the individual features of mel spectrogram. The experience shows that compared to the original Tacotron 2 model, Es-Tacotron2 can produce more variable decoder output and synthesize more natural and expressive speech.

Download Full-text

Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis

10.23919/eusipco54536.2021.9616249 ◽

2021 ◽

Author(s):

Ajinkya Kulkarni ◽

Vincent Colotte ◽

Denis Jouvet

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis ◽

End To End

Download Full-text

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6337 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8228-8235

Author(s):

Naihan Li ◽

Yanqing Liu ◽

Yu Wu ◽

Shujie Liu ◽

Sheng Zhao ◽

...

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Neural Model ◽

Attention Mechanism ◽

Maximum Length ◽

Prosodic Features ◽

Text To Speech ◽

Linguistic Features ◽

Excellent Quality ◽

Speech Model

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

Download Full-text