An analysis of machine translation and speech synthesis in speech-to-speech translation system

Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition. Based upon a statistically trained speech translation system, in this study, we try to investigate methodologies and metrics employed to assess the (speech-to-speech) way in translation systems. The speech translation is performed incrementally based on generation of partial hypotheses from speech recognition. Speech-input translation can be properly approached as a pattern recognition problem by means of statistical alignment models and stochastic finite-state transducers. Under this general framework, some specific models are presented. One of the features of such models is their capability of automatically learning from training examples. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. In this research, we want explore methodologies and metrics employed to assess the (speech-to-speech) way in translation systems.

Download Full-text

Emotional Speech Recognition and Synthesis in Multiple Languages toward Affective Speech-to-Speech Translation System

2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing ◽

10.1109/iih-msp.2014.148 ◽

2014 ◽

Author(s):

Masato Akagi ◽

Xiao Han ◽

Reda Elbarougy ◽

Yasuhiro Hamada ◽

Junfeng Li

Keyword(s):

Speech Recognition ◽

Translation System ◽

Emotional Speech ◽

Speech Translation ◽

Emotional Speech Recognition ◽

Multiple Languages ◽

Speech To Speech Translation

Download Full-text

Simultaneous Speech-to-Speech Translation System with Transformer-Based Incremental ASR, MT, and TTS

10.1109/o-cocosda202152914.2021.9660477 ◽

2021 ◽

Author(s):

Ryo Fukuda ◽

Sashi Novitasari ◽

Yui Oka ◽

Yasumasa Kano ◽

Yuki Yano ◽

...

Keyword(s):

Translation System ◽

Speech Translation ◽

Speech To Speech Translation

Download Full-text

Speaker-adaptive speech synthesis based on eigenvoice conversion and language-dependent prosodic conversion in speech-to-speech translation

10.21437/interspeech.2011-693 ◽

2011 ◽

Author(s):

Nobuhiko Hattori ◽

Tomoki Toda ◽

Hisashi Kawai ◽

Hiroshi Saruwatari ◽

Kiyohiro Shikano

Keyword(s):

Speech Synthesis ◽

Speech Translation ◽

Speech To Speech Translation

Download Full-text

Towards high performance LVCSR in speech-to-speech translation system on smart phones

10.21437/interspeech.2011-716 ◽

2011 ◽

Author(s):

Jian Xue ◽

Xiaodong Cui ◽

Gregg Daggett ◽

Etienne Marcheret ◽

Bowen Zhou

Keyword(s):

High Performance ◽

Smart Phones ◽

Translation System ◽

Speech Translation ◽

Speech To Speech Translation

Download Full-text

JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies

[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.1991.150456 ◽

1991 ◽

Cited By ~ 26

Author(s):

A. Waibel ◽

A.N. Jain ◽

A.E. McNair ◽

H. Saito ◽

A.G. Hauptmann ◽

...

Keyword(s):

Translation System ◽

Symbolic Processing ◽

Speech Translation ◽

Processing Strategies ◽

Speech To Speech Translation

Download Full-text

Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00270 ◽

2019 ◽

Vol 7 ◽

pp. 313-325 ◽

Cited By ~ 8

Author(s):

Matthias Sperber ◽

Graham Neubig ◽

Jan Niehues ◽

Alex Waibel

Keyword(s):

Machine Translation ◽

Training Data ◽

Translation System ◽

Speech Translation ◽

Direct Model ◽

Unrealistic Assumption ◽

Machine Translation System ◽

End To End ◽

Attentional Models ◽

Speech Recognizer

Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora. In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and although they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end- to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task–trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models.

Download Full-text