The Kestrel TTS text normalization system

AbstractThis paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initialtokenization and classificationphase of analysis fromverbalization. Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detectedsemiotic classes– expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers (https://code.google.com/p/protobuf/). The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.

Download Full-text

Text Normalization for Telugu Text-to-Speech Synthesis

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v11i2.1176 ◽

2013 ◽

Vol 11 (2) ◽

pp. 2241-2249

Author(s):

Dr. K.V.N. Sunitha ◽

P.Sunitha Devi

Keyword(s):

Speech Synthesis ◽

Text Processing ◽

Text To Speech ◽

Speech Technology ◽

Rule Based System ◽

Input Text ◽

Novel Approach ◽

Text To Speech Synthesis ◽

Processing Component ◽

Text Normalization

Most areas related to language and speech technology, directly or indirectly, require handling of unrestricted text, and Text-to-speech systems directly need to work on real text. To build a natural sounding speech synthesis system, it is essential that the text processing component produce an appropriate sequence of phonemic units corresponding to an arbitrary input text. A novel approach is used, where the input text is tokenized, and classification is done based on token type. The token sense disambiguation is achieved by the semantic nature of the language and then the expansion rules are applied to get the normalized text. However, for Telugu language not much work is done on text normalization. In this paper we discuss our efforts for designing a rule based system to achieve text normalization in the context of building Telugu text-to-speech system.

Download Full-text

Multilingual text analysis for text-to-speech synthesis

Natural Language Engineering ◽

10.1017/s1351324997001654 ◽

1996 ◽

Vol 2 (4) ◽

pp. 369-380 ◽

Cited By ~ 14

Author(s):

RICHARD SPROAT

Keyword(s):

Text Analysis ◽

Speech Synthesis ◽

Text To Speech ◽

Finite State Transducers ◽

Finite State ◽

Text To Speech Synthesis ◽

Weighted Finite State Transducers ◽

Multilingual Text ◽

Phonological Rules

We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite state transducers, which serves as the text analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of lexicons, morphological rules, numeral-expansion rules, and phonological rules, inter alia. To date, the model has been applied to eight languages: Spanish, Italian, Romanian, French, German, Russian, Mandarin and Japanese.

Download Full-text

Text-to-Speech Synthesis

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch135 ◽

2011 ◽

pp. 957-963

Author(s):

Mahbubur R. Syed ◽

Shuvro Chakrobartty ◽

Robert J. Bignall

Keyword(s):

Speech Production ◽

Speech Synthesis ◽

Synthetic Speech ◽

Practical Application ◽

Text To Speech ◽

Synthesis System ◽

System A ◽

Vocal System ◽

Text To Speech Synthesis ◽

Computer Based

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.

Download Full-text

N-WAY COMPOSITION OF WEIGHTED FINITE-STATE TRANSDUCERS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109006772 ◽

2009 ◽

Vol 20 (04) ◽

pp. 613-627 ◽

Cited By ~ 4

Author(s):

CYRIL ALLAUZEN ◽

MEHRYAR MOHRI

Keyword(s):

Speech Synthesis ◽

Linear Time ◽

Worst Case ◽

String Kernels ◽

Processing Step ◽

Worst Case Complexity ◽

Finite State ◽

Standard Composition ◽

Weighted Finite State Transducers ◽

Expected Complexity

Composition of weighted transducers is a fundamental algorithm used in many applications, including for computing complex edit-distances between automata, or string kernels in machine learning, or to combine different components of a speech recognition, speech synthesis, or information extraction system. We present a generalization of the composition of weighted transducers, n-way composition, which is dramatically faster in practice than the standard composition algorithm when combining more than two transducers. The worst-case complexity of our algorithm for composing three transducers T1, T2, and T3 resulting in T, is O(|T|Q min (d(T1)d(T3), d(T2)) + |T|E), where |·|Q denotes the number of states, |·|E the number of transitions, and d(·) the maximum out-degree. As in regular composition, the use of perfect hashing requires a pre-processing step with linear-time expected complexity in the size of the input transducers. In many cases, this approach significantly improves on the complexity of standard composition. Our algorithm also leads to a dramatically faster composition in practice. Furthermore, standard composition can be obtained as a special case of our algorithm. We report the results of several experiments demonstrating this improvement. These theoretical and empirical improvements significantly enhance performance in the applications already mentioned.

Download Full-text

Hidden semi-Markov Model based earthquake classification system using Weighted Finite-State Transducers

Nonlinear Processes in Geophysics ◽

10.5194/npg-18-81-2011 ◽

2011 ◽

Vol 18 (1) ◽

pp. 81-89 ◽

Cited By ~ 15

Author(s):

M. Beyreuther ◽

J. Wassermann

Keyword(s):

Seismic Data ◽

Speech Synthesis ◽

Markov Models ◽

Characteristic Functions ◽

Time Dependency ◽

General Applicability ◽

Earthquake Detection ◽

Finite State Transducers ◽

Finite State ◽

Weighted Finite State Transducers

Abstract. Automatic earthquake detection and classification is required for efficient analysis of large seismic datasets. Such techniques are particularly important now because access to measures of ground motion is nearly unlimited and the target waveforms (earthquakes) are often hard to detect and classify. Here, we propose to use models from speech synthesis which extend the double stochastic models from speech recognition by integrating a more realistic duration of the target waveforms. The method, which has general applicability, is applied to earthquake detection and classification. First, we generate characteristic functions from the time-series. The Hidden semi-Markov Models are estimated from the characteristic functions and Weighted Finite-State Transducers are constructed for the classification. We test our scheme on one month of continuous seismic data, which corresponds to 370 151 classifications, showing that incorporating the time dependency explicitly in the models significantly improves the results compared to Hidden Markov Models.

Download Full-text

Die Rolle der Textnormierung bei der Sprachvollsynthese / The Role of Text Normalization in Text-to-Speech Synthesis

it - Information Technology ◽

10.1524/itit.1989.31.5.342 ◽

1989 ◽

Vol 31 (5) ◽

Cited By ~ 1

Author(s):

D.S. Stall

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis ◽

Text Normalization

Download Full-text

Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System

IEEE Transactions on Audio Speech and Language Processing ◽

10.1109/tasl.2009.2023161 ◽

2009 ◽

Vol 17 (8) ◽

pp. 1567-1576 ◽

Cited By ~ 10

Author(s):

Zhiyong Wu ◽

H.M. Meng ◽

Hongwu Yang ◽

Lianhong Cai

Keyword(s):

Chinese Text ◽

Speech Synthesis ◽

Text To Speech ◽

Input Text ◽

Dialog System ◽

Spoken Dialog System ◽

Text To Speech Synthesis

Download Full-text

Neural Models of Text Normalization for Speech Applications

Computational Linguistics ◽

10.1162/coli_a_00349 ◽

2019 ◽

Vol 45 (2) ◽

pp. 293-337 ◽

Cited By ~ 4

Author(s):

Hao Zhang ◽

Richard Sproat ◽

Axel H. Ng ◽

Felix Stahlberg ◽

Xiaochang Peng ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Language Processing ◽

Speech Synthesis ◽

Network Models ◽

Neural Models ◽

Neural Network Models ◽

Sentential Context ◽

Finite State ◽

Text Normalization

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars. We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process. These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.

Download Full-text