Speech Recognition, Machine Translation, and Speech Translation—A Unified Discriminative Learning Paradigm [Lecture Notes]

2011 ◽  
Vol 28 (5) ◽  
pp. 126-133 ◽  
Author(s):  
Xiaodong He ◽  
Li Deng
2017 ◽  
Vol 11 (4) ◽  
pp. 55
Author(s):  
Parnyan Bahrami Dashtaki

Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition. Based upon a statistically trained speech translation system, in this study, we try to investigate methodologies and metrics employed to assess the (speech-to-speech) way in translation systems. The speech translation is performed incrementally based on generation of partial hypotheses from speech recognition. Speech-input translation can be properly approached as a pattern recognition problem by means of statistical alignment models and stochastic finite-state transducers. Under this general framework, some specific models are presented. One of the features of such models is their capability of automatically learning from training examples. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. In this research, we want explore methodologies and metrics employed to assess the (speech-to-speech) way in translation systems.


2020 ◽  
Vol 10 (20) ◽  
pp. 7263
Author(s):  
Yong-Hyeok Lee ◽  
Dong-Won Jang ◽  
Jae-Bin Kim ◽  
Rae-Hong Park ◽  
Hyung-Min Park

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.


Author(s):  
Jun Rokui ◽  

This paper presents MCE/GPD using GPD that is known as a highly effective discriminative learning method. MCE/GPD is an excellent recognition method that is applicable especially to speech recognition, since it excels in recognizing performance and can be used to deal with variable-length vectors. MCE/GPD involves a problem of calculation resulting from c omplicated algorithms making it impractical. In this paper, we propose a learning method to increase speed at learning based on a hierarchical model. We used a hierarchical neural network to evaluate the method’s performance.


Sign in / Sign up

Export Citation Format

Share Document