Segmental Recurrent Neural Networks for End-to-End Speech Recognition

This paper addresses the problem of speech recognition to identify various modes of speech data. Speaker sounds are the acoustic sounds of speech. Statistical models of speech have been widely used for speech recognition under neural networks. In paper we propose and try to justify a new model in which speech co articulation the effect of phonetic context on speech sound is modeled explicitly under a statistical framework. We study speech phone recognition by recurrent neural networks and SOUL Neural Networks. A general framework for recurrent neural networks and considerations for network training are discussed in detail. SOUL NN clustering the large vocabulary that compresses huge data sets of speech. This project also different Indian languages utter by different speakers in different modes such as aggressive, happy, sad, and angry. Many alternative energy measures and training methods are proposed and implemented. A speaker independent phone recognition rate of 82% with 25% frame error rate has been achieved on the neural data base. Neural speech recognition experiments on the NTIMIT database result in a phone recognition rate of 68% correct. The research results in this thesis are competitive with the best results reported in the literature.Â

Download Full-text

Advanced Recurrent Neural Networks for Automatic Speech Recognition

New Era for Robust Speech Recognition ◽

10.1007/978-3-319-64680-0_11 ◽

2017 ◽

pp. 261-279

Author(s):

Yu Zhang ◽

Dong Yu ◽

Guoguo Chen

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recurrent Neural Networks

Download Full-text

Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling

10.21437/interspeech.2014-151 ◽

2014 ◽

Author(s):

Jürgen T. Geiger ◽

Zixing Zhang ◽

Felix Weninger ◽

Björn Schuller ◽

Gerhard Rigoll

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Recurrent Neural Networks ◽

Short Term Memory ◽

Robust Speech Recognition ◽

Short Term ◽

Term Memory ◽

Acoustic Modelling ◽

Long Short Term Memory

Download Full-text

Convolutional Nonlinear Differential Recurrent Neural Networks for Crowd Scene Understanding

International Journal of Semantic Computing ◽

10.1142/s1793351x18400196 ◽

2018 ◽

Vol 12 (04) ◽

pp. 481-500 ◽

Cited By ~ 1

Author(s):

Naifan Zhuang ◽

The Duc Kieu ◽

Jun Ye ◽

Kien A. Hua

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Short Term Memory ◽

Scene Understanding ◽

Image Data ◽

High Density ◽

Temporal Information ◽

Deep Model ◽

End To End ◽

The Individual

With the growth of crowd phenomena in the real world, crowd scene understanding is becoming an important task in anomaly detection and public security. Visual ambiguities and occlusions, high density, low mobility, and scene semantics, however, make this problem a great challenge. In this paper, we propose an end-to-end deep architecture, convolutional nonlinear differential recurrent neural networks (CNDRNNs), for crowd scene understanding. CNDRNNs consist of GoogleNet Inception V3 convolutional neural networks (CNNs) and nonlinear differential recurrent neural networks (RNNs). Different from traditional non-end-to-end solutions which separate the steps of feature extraction and parameter learning, CNDRNN utilizes a unified deep model to optimize the parameters of CNN and RNN hand in hand. It thus has the potential of generating a more harmonious model. The proposed architecture takes sequential raw image data as input, and does not rely on tracklet or trajectory detection. It thus has clear advantages over the traditional flow-based and trajectory-based methods, especially in challenging crowd scenarios of high density and low mobility. Taking advantage of CNN and RNN, CNDRNN can effectively analyze the crowd semantics. Specifically, CNN is good at modeling the semantic crowd scene information. On the other hand, nonlinear differential RNN models the motion information. The individual and increasing orders of derivative of states (DoS) in differential RNN can progressively build up the ability of the long short-term memory (LSTM) gates to detect different levels of salient dynamical patterns in deeper stacked layers modeling higher orders of DoS. Lastly, existing LSTM-based crowd scene solutions explore deep temporal information and are claimed to be “deep in time.” Our proposed method CNDRNN, however, models the spatial and temporal information in a unified architecture and achieves “deep in space and time.” Extensive performance studies on the Violent-Flows, CUHK Crowd, and NUS-HGA datasets show that the proposed technique significantly outperforms state-of-the-art methods.

Download Full-text