scholarly journals Optimal generative and discriminative acoustic model training for speech recognition

Author(s):  
Neil Joshi

The focus of this dissertation is to derive and demonstrate effective stochastic models for the speech recognition problem. Acoustic modeling for speech recognition typically involves representing the speech process within stochastic models. Modeling this high frequency time series effectively is a fundamental problem. This dissertation devised an objective function that relates the true speech distribution to its estimate. It is shown that through optimizing this function the speech process time series can be modeled without loss of information. The thesis proposes two such models that are developed to optimize the devised objective function. The first an acoustic model formulated for the speech with noise problem. The second a discriminately trained model consisting of optimal discriminant ML estimators. The first, a combination of recognizers that through a simple system fusion, combines multiple speech processes at the decision level. This is a stochastic modeling method devised to combine a parameterized spectral missing data, MD, theory based and a cepstral based speech process using a coupled hidden variable topology. In using a fused coupled hidden Markov model, HMM, topology, an optimal acoustic model is proposed that is inherently more robust than single process models under noisy conditions. The theoretical capability of this model is tested under both stationary and non stationary noise conditions. Under these test conditions the fused model has greater recognition accuracies than those of single process models. The second, formulated with a methodology that segments the acoustic space appropriately for discriminately trained models that optimize the devised objective function. This acoustic space is modeled with discriminant ML estimators formed with optimal decision boundaries using the large margin, support vector machine, SVM, learning method. These discriminately trained models maximize the entropy of the observation space and thereby are capable to model the speech process without loss. This is demonstrated experimentally with frame level classification error rates that are ∼ ≤ 3%.

2021 ◽  
Author(s):  
Neil Joshi

The focus of this dissertation is to derive and demonstrate effective stochastic models for the speech recognition problem. Acoustic modeling for speech recognition typically involves representing the speech process within stochastic models. Modeling this high frequency time series effectively is a fundamental problem. This dissertation devised an objective function that relates the true speech distribution to its estimate. It is shown that through optimizing this function the speech process time series can be modeled without loss of information. The thesis proposes two such models that are developed to optimize the devised objective function. The first an acoustic model formulated for the speech with noise problem. The second a discriminately trained model consisting of optimal discriminant ML estimators. The first, a combination of recognizers that through a simple system fusion, combines multiple speech processes at the decision level. This is a stochastic modeling method devised to combine a parameterized spectral missing data, MD, theory based and a cepstral based speech process using a coupled hidden variable topology. In using a fused coupled hidden Markov model, HMM, topology, an optimal acoustic model is proposed that is inherently more robust than single process models under noisy conditions. The theoretical capability of this model is tested under both stationary and non stationary noise conditions. Under these test conditions the fused model has greater recognition accuracies than those of single process models. The second, formulated with a methodology that segments the acoustic space appropriately for discriminately trained models that optimize the devised objective function. This acoustic space is modeled with discriminant ML estimators formed with optimal decision boundaries using the large margin, support vector machine, SVM, learning method. These discriminately trained models maximize the entropy of the observation space and thereby are capable to model the speech process without loss. This is demonstrated experimentally with frame level classification error rates that are ∼ ≤ 3%.


2021 ◽  
Author(s):  
Yashodhan Rajiv Athavale

The objective of this study is to assess the performance and capability of a kernel-based machine learning method for time-series signal classification. Applying various stages of dimension transformation, training, testing and cross-validation, we attempt to perform a binary classification using the time-series signals from each category. This study has been applied to two domains: Financial and Biomedical. The financial domain study involves identifying the possibility of collapse or survival of a company trading in the stock market. For assessing the fate of each company, we collect its real stock market data, which is basically a financial time-series composed of weekly closing stock prices in a common time-series interval. This study has been applied to various economic sectors such as Pharmaceuticals and Biotechnology, Automobiles, Oil & Gas, Water Supply etc. The data has been collected using Thomson’s Datastream software. In the biomedical study we are dealing with knee signals collected using the Vibration arthrometry technique. This study involves using the severity of cartilage degeneration for assessing the possibility omachinf a subject getting affected by Osteoarthritis or undergoing knee replacement surgery at a later stage. This non-invasive diagnostic method can also prove be an alternative to various invasive procedures used for detecting osteoarthritis. For this analysis we have used the vibroarthro-signals for about 38 abnormal and 51 normal knee joint case studies. In both studies we apply Fisher Kernels incorporated with Gaussian Mixture Model (GMM) for dimension transformation into feature space created as a three-dimensional plot for visualization. The transformed data is then trained and tested using support vector machines for performing binary classification. From our experiments we observe that our method fits really well for both the studies with the classification error rate between 10% to 15%.


2021 ◽  
Author(s):  
Yashodhan Rajiv Athavale

The objective of this study is to assess the performance and capability of a kernel-based machine learning method for time-series signal classification. Applying various stages of dimension transformation, training, testing and cross-validation, we attempt to perform a binary classification using the time-series signals from each category. This study has been applied to two domains: Financial and Biomedical. The financial domain study involves identifying the possibility of collapse or survival of a company trading in the stock market. For assessing the fate of each company, we collect its real stock market data, which is basically a financial time-series composed of weekly closing stock prices in a common time-series interval. This study has been applied to various economic sectors such as Pharmaceuticals and Biotechnology, Automobiles, Oil & Gas, Water Supply etc. The data has been collected using Thomson’s Datastream software. In the biomedical study we are dealing with knee signals collected using the Vibration arthrometry technique. This study involves using the severity of cartilage degeneration for assessing the possibility omachinf a subject getting affected by Osteoarthritis or undergoing knee replacement surgery at a later stage. This non-invasive diagnostic method can also prove be an alternative to various invasive procedures used for detecting osteoarthritis. For this analysis we have used the vibroarthro-signals for about 38 abnormal and 51 normal knee joint case studies. In both studies we apply Fisher Kernels incorporated with Gaussian Mixture Model (GMM) for dimension transformation into feature space created as a three-dimensional plot for visualization. The transformed data is then trained and tested using support vector machines for performing binary classification. From our experiments we observe that our method fits really well for both the studies with the classification error rate between 10% to 15%.


2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Yashodhan Athavale ◽  
Sridhar Krishnan ◽  
Aziz Guergachi

The intention of this study is to gauge the performance of Fisher kernels for dimension simplification and classification of time-series signals. Our research work has indicated that Fisher kernels have shown substantial improvement in signal classification by enabling clearer pattern visualization in three-dimensional space. In this paper, we will exhibit the performance of Fisher kernels for two domains: financial and biomedical. The financial domain study involves identifying the possibility of collapse or survival of a company trading in the stock market. For assessing the fate of each company, we have collected financial time-series composed of weekly closing stock prices in a common time frame, using Thomson Datastream software. The biomedical domain study involves knee signals collected using the vibration arthrometry technique. This study uses the severity of cartilage degeneration for classifying normal and abnormal knee joints. In both studies, we apply Fisher Kernels incorporated with a Gaussian mixture model (GMM) for dimension transformation into feature space, which is created as a three-dimensional plot for visualization and for further classification using support vector machines. From our experiments we observe that Fisher Kernel usage fits really well for both kinds of signals, with low classification error rates.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 634
Author(s):  
Alakbar Valizada ◽  
Natavan Akhundova ◽  
Samir Rustamov

In this paper, various methodologies of acoustic and language models, as well as labeling methods for automatic speech recognition for spoken dialogues in emergency call centers were investigated and comparatively analyzed. Because of the fact that dialogue speech in call centers has specific context and noisy, emotional environments, available speech recognition systems show poor performance. Therefore, in order to accurately recognize dialogue speeches, the main modules of speech recognition systems—language models and acoustic training methodologies—as well as symmetric data labeling approaches have been investigated and analyzed. To find an effective acoustic model for dialogue data, different types of Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) and Deep Neural Network/Hidden Markov Model (DNN/HMM) methodologies were trained and compared. Additionally, effective language models for dialogue systems were defined based on extrinsic and intrinsic methods. Lastly, our suggested data labeling approaches with spelling correction are compared with common labeling methods resulting in outperforming the other methods with a notable percentage. Based on the results of the experiments, we determined that DNN/HMM for an acoustic model, trigram with Kneser–Ney discounting for a language model and using spelling correction before training data for a labeling method are effective configurations for dialogue speech recognition in emergency call centers. It should be noted that this research was conducted with two different types of datasets collected from emergency calls: the Dialogue dataset (27 h), which encapsulates call agents’ speech, and the Summary dataset (53 h), which contains voiced summaries of those dialogues describing emergency cases. Even though the speech taken from the emergency call center is in the Azerbaijani language, which belongs to the Turkic group of languages, our approaches are not tightly connected to specific language features. Hence, it is anticipated that suggested approaches can be applied to the other languages of the same group.


2007 ◽  
Vol 22 (2) ◽  
pp. 113-126 ◽  
Author(s):  
V. Monbet ◽  
P. Ailliot ◽  
M. Prevosto

Sign in / Sign up

Export Citation Format

Share Document