ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation

Author(s):  
Wenbo Zhu ◽  
Hao Jin ◽  
JianWen Chen ◽  
Lufeng Luo ◽  
Jinhai Wang ◽  
...  
2005 ◽  
Author(s):  
Andreas Kathol ◽  
Kristin Precoda ◽  
Dimitra Vergyri ◽  
Wen Wang ◽  
Susanne Riehemann

2021 ◽  
Author(s):  
Pavel Denisov ◽  
Manuel Mager ◽  
Ngoc Thang Vu

2021 ◽  
Author(s):  
Yao-Fei Cheng ◽  
Hung-Shin Lee ◽  
Hsin-Min Wang

IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 206638-206645
Author(s):  
Xinlu Zhang ◽  
Xiao Li ◽  
Yating Yang ◽  
Rui Dong

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Wenbo Zhu ◽  
Hao Jin ◽  
WeiChang Yeh ◽  
Jianwen Chen ◽  
Lufeng Luo ◽  
...  

Speech translation (ST) is a bimodal conversion task from source speech to the target text. Generally, deep learning-based ST systems require sufficient training data to obtain a competitive result, even with a state-of-the-art model. However, the training data is usually unable to meet the completeness condition due to the small sample problems. Most low-resource ST tasks improve data integrity with a single model, but this optimization has a single dimension and limited effectiveness. In contrast, multimodality is introduced to leverage different dimensions of data features for multiperspective modeling. This approach mutually addresses the gaps in the different modalities to enhance the representation of the data and improve the utilization of the training samples. Therefore, it is a new challenge to leverage the enormous multimodal out-of-domain information to improve the low-resource tasks. This paper describes how to use multimodal out-of-domain information to improve low-resource models. First, we propose a low-resource ST framework to reconstruct large-scale label-free audio by combining self-supervised learning. At the same time, we introduce a machine translation (MT) pretraining model to complement text embedding and fine-tune decoding. In addition, we analyze the similarity at the decoder side. We reduce multimodal invalid pseudolabels by performing random depth pruning in the similarity layer to minimize error propagation and use additional CTC loss in the nonsimilarity layer to optimize the ensemble loss. Finally, we study the weighting ratio of the fusion technique in the multimodal decoder. Our experiment results show that the proposed method is promising for low-resource ST, with improvements of up to +3.6 BLEU points compared to baseline low-resource ST models.


2021 ◽  
Author(s):  
Hirofumi Inaguma ◽  
Tatsuya Kawahara ◽  
Shinji Watanabe

Sign in / Sign up

Export Citation Format

Share Document