A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

Download Full-text

Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

Applied Sciences ◽

10.3390/app9214639 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4639 ◽

Cited By ~ 3

Author(s):

Long Wu ◽

Ta Li ◽

Li Wang ◽

Yonghong Yan

Keyword(s):

Speech Recognition ◽

Proper Time ◽

Window Size ◽

Attention Mechanism ◽

Wall Street ◽

Word Error Rate ◽

Perceptual Evaluation ◽

Left And Right ◽

End To End ◽

Multiparty Interaction

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.

Download Full-text

Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

Applied Sciences ◽

10.3390/app11156975 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6975

Author(s):

Tao Zhang ◽

Lun He ◽

Xudong Li ◽

Guoqing Feng

Keyword(s):

Performance Improvement ◽

State Of The Art ◽

Error Rates ◽

Convolutional Network ◽

Convolutional Networks ◽

Sentence Level ◽

End To End ◽

High Level ◽

Improved Accuracy ◽

Talking Face

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.

Download Full-text

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Sensors ◽

10.3390/s21093063 ◽

2021 ◽

Vol 21 (9) ◽

pp. 3063

Author(s):

Aleksandr Laptev ◽

Andrei Andrusenko ◽

Ivan Podluzhny ◽

Anton Mitrofanov ◽

Ivan Medennikov ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Rapid Development ◽

Computational Cost ◽

Vocabulary Size ◽

Word Error Rate ◽

Low Resource ◽

Steady Improvement ◽

End To End ◽

Asr System

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Multi-level region-of-interest CNNs for end to end speech recognition

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-018-1146-z ◽

2018 ◽

Vol 10 (11) ◽

pp. 4615-4624 ◽

Cited By ~ 6

Author(s):

Shubhanshi Singhal ◽

Vishal Passricha ◽

Pooja Sharma ◽

Rajesh Kumar Aggarwal

Keyword(s):

Speech Recognition ◽

Region Of Interest ◽

Multi Level ◽

End To End

Download Full-text

Unsupervised Beamforming Based on Multichannel Nonnegative Matrix Factorization for Noisy Speech Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462642 ◽

2018 ◽

Cited By ~ 4

Author(s):

Kazuki Shimada ◽

Yoshiaki Bando ◽

Masato Mimura ◽

Katsutoshi Itoyama ◽

Kazuyoshi Yoshii ◽

...

Keyword(s):

Speech Recognition ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Noisy Speech ◽

Noisy Speech Recognition

Download Full-text

Adaptive Structure Concept Factorization for Multiview Clustering

Neural Computation ◽

10.1162/neco_a_01055 ◽

2018 ◽

Vol 30 (4) ◽

pp. 1080-1103 ◽

Cited By ~ 10

Author(s):

Kun Zhan ◽

Jinhui Shi ◽

Jing Wang ◽

Haibo Wang ◽

Yuange Xie

Keyword(s):

Nonnegative Matrix Factorization ◽

State Of The Art ◽

Nonnegative Matrix ◽

Adaptive Method ◽

Data Sets ◽

Clustering Methods ◽

Normalized Mutual Information ◽

Adaptive Structure ◽

Concept Factorization ◽

Multiview Clustering

Most existing multiview clustering methods require that graph matrices in different views are computed beforehand and that each graph is obtained independently. However, this requirement ignores the correlation between multiple views. In this letter, we tackle the problem of multiview clustering by jointly optimizing the graph matrix to make full use of the data correlation between views. With the interview correlation, a concept factorization–based multiview clustering method is developed for data integration, and the adaptive method correlates the affinity weights of all views. This method differs from nonnegative matrix factorization–based clustering methods in that it can be applicable to data sets containing negative values. Experiments are conducted to demonstrate the effectiveness of the proposed method in comparison with state-of-the-art approaches in terms of accuracy, normalized mutual information, and purity.

Download Full-text

Towards High-Level Intrinsic Exploration in Reinforcement Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/733 ◽

2020 ◽

Author(s):

Nicolas Bougie ◽

Ryutaro Ichise

Keyword(s):

Reinforcement Learning ◽

Time Horizon ◽

State Of The Art ◽

Experimental Results ◽

Prior Work ◽

Extrinsic Rewards ◽

Intrinsic Reward ◽

Long Time ◽

End To End ◽

High Level

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.

Download Full-text

Y-Net: Dual-branch Joint Network for Semantic Segmentation

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460940 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Yizhen Chen ◽

Haifeng Hu

Keyword(s):

Feature Vector ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Semantic Segmentation ◽

Global Context ◽

Multi Level ◽

The One ◽

Public Datasets ◽

High Level

Most existing segmentation networks are built upon a “ U -shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “ U -shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “ U -shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.

Download Full-text

Parameterized Nonlinear Least Squares for Unsupervised Nonlinear Spectral Unmixing

Remote Sensing ◽

10.3390/rs11020148 ◽

2019 ◽

Vol 11 (2) ◽

pp. 148 ◽

Cited By ~ 3

Author(s):

Risheng Huang ◽

Xiaorun Li ◽

Haiqiang Lu ◽

Jing Li ◽

Liaoying Zhao

Keyword(s):

Least Squares ◽

Optimization Problem ◽

Nonnegative Matrix Factorization ◽

Optimization Problems ◽

State Of The Art ◽

Synthetic Data ◽

Nonnegative Matrix ◽

Nonlinear Least Squares ◽

Hyperspectral Data ◽

Spectral Unmixing

This paper presents a new parameterized nonlinear least squares (PNLS) algorithm for unsupervised nonlinear spectral unmixing (UNSU). The PNLS-based algorithms transform the original optimization problem with respect to the endmembers, abundances, and nonlinearity coefficients estimation into separate alternate parameterized nonlinear least squares problems. Owing to the Sigmoid parameterization, the PNLS-based algorithms are able to thoroughly relax the additional nonnegative constraint and the nonnegative constraint in the original optimization problems, which facilitates finding a solution to the optimization problems . Subsequently, we propose to solve the PNLS problems based on the Gauss–Newton method. Compared to the existing nonnegative matrix factorization (NMF)-based algorithms for UNSU, the well-designed PNLS-based algorithms have faster convergence speed and better unmixing accuracy. To verify the performance of the proposed algorithms, the PNLS-based algorithms and other state-of-the-art algorithms are applied to synthetic data generated by the Fan model and the generalized bilinear model (GBM), as well as real hyperspectral data. The results demonstrate the superiority of the PNLS-based algorithms.

Download Full-text

Graph Regularized Nonnegative Matrix Factorization with Sparse Coding

Mathematical Problems in Engineering ◽

10.1155/2015/239589 ◽

2015 ◽

Vol 2015 ◽

pp. 1-11 ◽

Cited By ~ 5

Author(s):

Chuang Lin ◽

Meng Pang

Keyword(s):

Sparse Coding ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Geometrical Structure ◽

State Of The Art ◽

Nonnegative Matrix ◽

Target Function ◽

Original Data ◽

Data Space ◽

Detailed Derivation

In this paper, we propose a sparseness constraint NMF method, named graph regularized matrix factorization with sparse coding (GRNMF_SC). By combining manifold learning and sparse coding techniques together, GRNMF_SC can efficiently extract the basic vectors from the data space, which preserves the intrinsic manifold structure and also the local features of original data. The target function of our method is easy to propose, while the solving procedures are really nontrivial; in the paper we gave the detailed derivation of solving the target function and also a strict proof of its convergence, which is a key contribution of the paper. Compared with sparseness constrained NMF and GNMF algorithms, GRNMF_SC can learn much sparser representation of the data and can also preserve the geometrical structure of the data, which endow it with powerful discriminating ability. Furthermore, the GRNMF_SC is generalized as supervised and unsupervised models to meet different demands. Experimental results demonstrate encouraging results of GRNMF_SC on image recognition and clustering when comparing with the other state-of-the-art NMF methods.

Download Full-text