scholarly journals Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

2021 ◽  
Vol 11 (6) ◽  
pp. 2642
Author(s):  
Yun Kyung Lee ◽  
Jeon Gue Park

This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker’s English sentence is different from the native speaker’s one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.

Speech recognition is widely used in the computer science to make well-organized communication between humans and computers. This paper addresses the problem of speech recognition for Varhadi, the regional language of the state of Maharashtra in India. Varhadi is widely spoken in Maharashtra state especially in Vidharbh region. Viterbi algorithm is used to recognize unknown words using Hidden Markov Model (HMM). The dataset is developed to train the system consists of 83 isolated Varhadi words. A Mel frequency cepstral coefficient (MFCCs) is used as feature extraction to perform the acoustical analysis of speech signal. Word model is implemented in speaker independent mode for the proposed varhadi automatic speech recognition system (V-ASR). The training and test dataset consist of isolated words uttered by 8 native speakers of Varhadi language. The V-ASR system has recognized the Varhadi words satisfactorily with 92.77%. recognition performance.


Author(s):  
Na Wang ◽  
Xiaohong Zhang ◽  
Ashutosh Sharma

: The computer assisted speech recognition system enabling voice recognition for understanding the spoken words using sound digitization is extensively being used in the field of education, scientific research, industry, etc. This article unveils the technological perspective of automated speech recognition system in order to realize the spoken English speech recognition system based on MATLAB. A speech recognition technology has been designed and implemented in this work which can collect the speech signals of the spoken English learning system and then filter those speech signals. This paper mainly adopts the preprocessing module for the processing of the raw speech data collected utilizing the MATLAB commands. The method of feature extraction is based on HMM model, codebook generation and template training. The research results show that the recognition accuracy of 98% is achieved by the spoken English speech recognition system studied in this paper. It can be seen that the spoken English speech recognition system based on MATLAB has high recognition accuracy and fast speed. This work addresses the current research issued needed to be tackled in the speech recognition field. This approach is able to provide the technical support and interface for the spoken English learning system.


Author(s):  
Keshav Sinha ◽  
Rasha Subhi Hameed ◽  
Partha Paul ◽  
Karan Pratap Singh

In recent years, the advancement in voice-based authentication leads in the field of numerous forensic voice authentication technology. For verification, the speech reference model is collected from various open-source clusters. In this chapter, the primary focus is on automatic speech recognition (ASR) technique which stores and retrieves the data and processes them in a scalable manner. There are the various conventional techniques for speech recognition such as BWT, SVD, and MFCC, but for automatic speech recognition, the efficiency of these conventional recognition techniques degrade. So, to overcome this problem, the authors propose a speech recognition system using E-SVD, D3-MFCC, and dynamic time wrapping (DTW). The speech signal captures its important qualities while discarding the unimportant and distracting features using D3-MFCC.


1970 ◽  
Vol 110 (4) ◽  
pp. 113-116 ◽  
Author(s):  
R. Lileikyte ◽  
L. Telksnys

The best feature set selection is the key of successful speech recognition system. Quality measure is needed to characterize the chosen feature set. Variety of feature quality metrics are proposed by other authors. However, no guidance is given to choose the appropriate metric. Also no metrics investigations for speech features were made. In the paper the methodology for quality estimation of speech features is presented. Metrics have to be chosen on the ground of their correlation with classification results. Linear Frequency Cepstrum (LFCC), Mel Frequency Cepstrum (MFCC), Perceptual Linear Prediction (PLP) analyses were selected for experiment. The most proper metric was chosen in combination with Dynamic Time Warping (DTW) classifier. Experimental investigation results are presented. Ill. 5, bibl. 18, tabl. 3 (in English; abstracts in English and Lithuanian).http://dx.doi.org/10.5755/j01.eee.110.4.302


2011 ◽  
Vol 33 (2) ◽  
pp. 419-456 ◽  
Author(s):  
JEFFREY WITZEL ◽  
NAOKO WITZEL ◽  
JANET NICOL

ABSTRACTThis study examines the reading patterns of native speakers (NSs) and high-level (Chinese) nonnative speakers (NNSs) on three English sentence types involving temporarily ambiguous structural configurations. The reading patterns on each sentence type indicate that both NSs and NNSs were biased toward specific structural interpretations. These results are interpreted as evidence that both first-language and second-language (L2) sentence comprehension is guided (at least in part) by structure-based parsing strategies and, thus as counterevidence to the claim that NNSs are largely limited to rudimentary (or “shallow”) syntactic computation during online L2 sentence processing.


2014 ◽  
Vol 543-547 ◽  
pp. 2337-2340 ◽  
Author(s):  
Yi Zhang ◽  
Xiao Song Li ◽  
Yang Song

Isolated-word speech-recognition system adopted the shortest distance of Dynamic Time Warping (DTW) to make recognition judgment, which has the disadvantage of high False Accept Rate (FAR), poor anti-noise and robustness. This paper proposes a new method based on DTW distance Threshold Estimation for recognition judgment. This method processes the maximum distance between template speech and training input speech multiplying adjusting coefficient, then plus noise DTW distance, which regard the final result as distance Threshold Estimation. At the time of doing speech recognition, if the distance between testing speech and template speech exceeds the Threshold Estimation, then the system will not recognize this speech. The experiment shows that this method can greatly improve the anti-noise and robustness performance of the Isolated-word speech-recognition system and solve the problem of high FAR.


2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
Ing-Jr Ding ◽  
Yen-Ming Hsu

In the past, the kernel of automatic speech recognition (ASR) is dynamic time warping (DTW), which is feature-based template matching and belongs to the category technique of dynamic programming (DP). Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM-) like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation). A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.


Sign in / Sign up

Export Citation Format

Share Document