Incorporating a Generative Front-End Layer to Deep Neural Network for Noise Robust Automatic Speech Recognition

In recent decades, researchers have been focused on developing noise-robust methods in order to compensate for noise effects in automatic speech recognition (ASR) systems and enhance their performance. In this paper, we propose a feature-based noise-robust method that employs a novel data analysis technique—robust principal component analysis (RPCA). In the proposed scenario, RPCA is employed to process a noise-corrupted speech feature matrix, and the obtained sparse partition is shown to reveal speech-dominant characteristics. One apparent advantage of using RPCA for enhancing noise robustness is that no prior knowledge about the noise is required. The proposed RPCA-based method is evaluated with the Aurora-4 database and a task using a state-of-the-art deep neural network (DNN) architecture as the acoustic models. The evaluation results indicate that the newly proposed method can provide the original speech feature with significant recognition accuracy improvement, and can be cascaded with mean normalization (MN), mean and variance normalization (MVN), and relative spectral (RASTA)—three well-known and widely used feature robustness algorithms—to achieve better performance compared with the individual component method.

Download Full-text

A Companding Front End for Noise-Robust Automatic Speech Recognition

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. ◽

10.1109/icassp.2005.1415097 ◽

2006 ◽

Cited By ~ 1

Author(s):

J. Guinness ◽

B. Raj ◽

B. Schmidt-Nielsen ◽

L. Turicchia ◽

R. Sarpeshkar

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Front End ◽

Noise Robust

Download Full-text

Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition

10.21437/interspeech.2019-2641 ◽

2019 ◽

Author(s):

Khoi-Nguyen C. Mac ◽

Xiaodong Cui ◽

Wei Zhang ◽

Michael Picheny

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Large Scale ◽

Deep Neural Network ◽

Acoustic Modeling

Download Full-text

Deep Neural Network-based Speech Separation Combining with MVDR Beamformer for Automatic Speech Recognition System

2019 IEEE International Conference on Consumer Electronics (ICCE) ◽

10.1109/icce.2019.8662086 ◽

2019 ◽

Author(s):

Bong-Ki Lee ◽

Jaewoong Jeong

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network ◽

Recognition System ◽

Speech Recognition System ◽

Speech Separation ◽

Automatic Speech Recognition System

Download Full-text

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200730225301 ◽

2020 ◽

Vol 13 ◽

Author(s):

Mohit Dua ◽

Pawandeep Singh Sethi ◽

Vinam Agrawal ◽

Raghav Chawla

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Performance Comparison ◽

Acoustic Modeling ◽

Extraction Techniques ◽

Front End ◽

Noise Robust ◽

Asr System

Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel's, Dan's and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan's DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan's DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan's DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions

Download Full-text

Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition

10.21437/interspeech.2014-306 ◽

2014 ◽

Author(s):

Zhen Huang ◽

Jinyu Li ◽

Chao Weng ◽

Chin-Hui Lee

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network ◽

Cross Entropy ◽

Neural Network Training ◽

Objective Functions ◽

Network Training

Download Full-text

Joint acoustic factor learning for robust deep neural network based automatic speech recognition

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2016.7472634 ◽

2016 ◽

Cited By ~ 12

Author(s):

Souvik Kundu ◽

Gautam Mantena ◽

Yanmin Qian ◽

Tian Tan ◽

Marc Delcroix ◽

...

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network

Download Full-text

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

IEEE Signal Processing Magazine ◽

10.1109/msp.2020.2969859 ◽

2020 ◽

Vol 37 (3) ◽

pp. 39-49

Author(s):

Xiaodong Cui ◽

Wei Zhang ◽

Ulrich Finkler ◽

George Saon ◽

Michael Picheny ◽

...

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network ◽

Acoustic Models ◽

Distributed Training ◽

Training Strategies

Download Full-text

Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types

2020 International Conference on Electrical Engineering and Informatics (ICELTICs) ◽

10.1109/iceltics50595.2020.9315538 ◽

2020 ◽

Author(s):

Taufik Fuadi Abidin ◽

Alim Misbullah ◽

Ridha Ferdhiana ◽

Muammar Zikri Aksana ◽

Laina Farsiah

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network

Download Full-text

Dialogue System of Controlling Robot Based on the Theory of Finite-State Automata

Mekhatronika Avtomatizatsiya Upravlenie ◽

10.17587/mau.20.686-695 ◽

2019 ◽

Vol 20 (11) ◽

pp. 686-695

Author(s):

Yin Shuai ◽

A. S. Yuschenko

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network ◽

Continuous Speech Recognition ◽

Dialogue System ◽

Dialogue Management ◽

Special Mode ◽

Finite State ◽

General Mode

The article discusses the system of dialogue control manipulation robots. The analysis of the basic methods of automatic speech recognition, speech understanding, dialogue management, voice response synthesis in dialogue systems has been carried out. Three types of dialogue management are considered as "system initiative", "user initiative" and "combined initiative". A system of object-oriented dialog control of a robot based on the theory of finite state machines with using a deep neural network is proposed. The main difference of the proposed system lies in the separate implementation of the dialogue process and robot’s actions, which is close to the pace of natural dialogue control. This method of constructing a dialogue control robot allows system to automatically correct the result of speech recognition, robot’s actions based on tasks. The necessity of correcting the result of speech recognition and robot’s actions may be caused by the users’ accent, working environment noise or incorrect voice commands. The process of correcting speech recognition results and robot’s actions consists of three stages, respectively, in a special mode and a general mode. The special mode allows users to directly control the manipulator by voice commands. The general mode extends the capabilities of users, allowing them to get additional information in real time. At the first stage, continuous speech recognition is built by using a deep neural network, taking into account the accents and speech speeds of various users. Continuous speech recognition is a real-time voice to text conversion. At the second stage, the correction of the speech recognition result by managing the dialogue based on the theory of finite automata. At the third stage, the actions of the robot are corrected depending on the operating state of the robot and the dialogue management process. In order to realize a natural dialogue between users and robots, the problem is solved in creating a small database of possible dialogues and using various training data. In the experiments, the dialogue system is used to control the KUKA manipulator (KRC4 control) to put the desired block in the specified location, implemented in the Python environment using the RoboDK software. The processes and results of experiments confirming the operability of the interactive robot control system are given. A fairly high accuracy (92 %) and an automatic speech recognition rate close to the rate of natural speech were obtained.

Download Full-text