A Fast and Lightweight Text-To-Speech Model with Spectrum and Waveform Alignment Algorithms

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

Download Full-text

Non-parallel many-to-many voice conversion by knowledge transfer from a pre-trained text-to-speech model

10.14711/thesis-991012880164303412 ◽

2020 ◽

Author(s):

Xinyuan Yu

Keyword(s):

Knowledge Transfer ◽

Voice Conversion ◽

Text To Speech ◽

Speech Model

Download Full-text

Artificial Generation of Realistic Voices

International Journal of Applied Sciences and Smart Technologies ◽

10.24071/ijasst.v3i1.2744 ◽

2021 ◽

Vol 03 (01) ◽

pp. 11-26

Author(s):

Dhruva Mahajan ◽

◽

Ashish Gapat ◽

Lalita Moharkar ◽

Prathamesh Sawant ◽

...

Keyword(s):

Entire Function ◽

Speech Synthesis ◽

Prior Training ◽

Text To Speech ◽

Text Data ◽

Output Sequence ◽

Input Text ◽

End To End ◽

Speech Model

In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input text data which gets synthesized, variated, and altered into artificial voice at the output end. To create a text-to-speech model, that is, a model capable of generating speech with the help of trained datasets. It follows a process which organizes the entire function to present the output sequence in three parts. These three parts are Speaker Encoder, Synthesizer, and Vocoder. Subsequently, using datasets, the model accomplishes generation of voice with prior training and maintains the naturalness of speech throughout. For naturalness of speech we implement a zero-shot adaption technique. The primary capability of the model is to provide the ability of regeneration of voice, which has a variety of applications in the advancement of the domain of speech synthesis. With the help of speaker encoder, our model synthesizes user generated voice if the user wants the output trained on his/her voice which is feeded through the mic, present in GUI. Regeneration capabilities lie within the domain Voice Regeneration which generates similar voice waveforms for any text.

Download Full-text

In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

10.18653/v1/n19-2026 ◽

2019 ◽

Cited By ~ 2

Author(s):

Nishant Prateek ◽

Mateusz Łajszczak ◽

Roberto Barra-Chicote ◽

Thomas Drugman ◽

Jaime Lorenzo-Trueba ◽

...

Keyword(s):

Limited Data ◽

Text To Speech ◽

Speech Model

Download Full-text

Applying a Self-Regulating Private Speech Model to Classroom Settings

Language Speech and Hearing Services in Schools ◽

10.1044/0161-1461.1302.129 ◽

1982 ◽

Vol 13 (2) ◽

pp. 129-133

Author(s):

A. D. Pellegrini

Keyword(s):

Problem Solving ◽

Private Speech ◽

Speech Model

The paper explores the processes by which children use private speech to regulate their behaviors. The first part of the paper explores the ontological development of self-regulating private speech. The theories of Vygotsky and Luria are used to explain this development. The second part of the paper applies these theories to pedagogical settings. The process by which children are exposed to dialogue strategies that help them solve problems is outlined. The strategy has children posing and answering four questions: What is the problem? How will I solve it? Am I using the plan? How did it work? It is argued that this model helps children systematically mediate their problem solving processes.

Download Full-text

A Fast and Lightweight Text-To-Speech Model with Spectrum and Waveform Alignment Algorithms

Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model

The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Non-parallel many-to-many voice conversion by knowledge transfer from a pre-trained text-to-speech model

Artificial Generation of Realistic Voices

In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

Applying a Self-Regulating Private Speech Model to Classroom Settings

Export Citation Format