scholarly journals Enhanced Speech Recognition for Indonesian Geographic Dictionary Using Deep Learning

Speech recognition technology has been developing very fast lately. One of its application is to know the meaning of some terms included in a geographic dictionary. When a subject speaks a word to the system, it will output the word and its meaning and explanation. There are many methods that are applied to speech recognition. One of the methods that can be applied and improve the accuracy of speech recognition is the use of a deep learning method, i.e. Convolutional Neural Network (CNN). In this research, CNN's speech recognition accuracy for the Indonesian geographic dictionary is analyzed to show that CNN can improve the accuracy of speech recognition compared to speech recognition with Gaussian mixture model and hidden Markov model (GMM-HMM). CNN is one of deep learning methods that analyzes and finds similarity in Mel-frequency cepstral coefficients (MFCC) from sound waves. This research is performed by making models of the spoken words using CNN under Python and TensorFlow. CNN is trained with these models from speech data collected and prepared from 20 students, consists of 19 men and a woman of different ages from 19 to 23 years. The vocabulary of the database consists of 50 words. The result of this research is a desktop application with the trained models implemented. Our application can recognize well the spoken words from subjects. Testing of the trained models was performed to examine the accuracy of the build speech recognition system. The result of the CNN speech recognition method from the Indonesian geographic dictionary is 80% accuracy for isolated words and 72.67% for continuous words in our research.

2019 ◽  
Vol 8 (3) ◽  
pp. 7827-7831

Kannada is the regional language of India spoken in Karnataka. This paper presents development of continuous kannada speech recognition system using monophone modelling and triphone modelling using HTK. Mel Frequency Cepstral Coefficient (MFCC) is used as feature extractor, exploits cepstral and perceptual frequency scale leads good recognition accuracy. Hidden Markov Model is used as classifier. In this paper Gaussian mixture splitting is done that captures the variations of the phones. The paper presents performance of continuous Kannada Automatic Speech Recognition (ASR) system with respect to 2, 4,8,16 and 32 Gaussian mixtures with monophone and context dependent tri-phone modelling. The experimental result shows that good recognition accuracy is achieved for context dependent tri-phone modelling than monophone modelling as the number Gaussian mixture is increased.


2011 ◽  
Vol 268-270 ◽  
pp. 82-87
Author(s):  
Zhi Peng Zhao ◽  
Yi Gang Cen ◽  
Xiao Fang Chen

In this paper, we proposed a new noise speech recognition method based on the compressive sensing theory. Through compressive sensing, our method increases the anti-noise ability of speech recognition system greatly, which leads to the improvement of the recognition accuracy. According to the experiments, our proposed method achieved better recognition performance compared with the traditional isolated word recognition method based on DTW algorithm.


Author(s):  
Eslam E. El Maghraby ◽  
Amr M. Gody ◽  
M. Hesham Farouk

Background: Multimodal speech recognition is proved to be one of the most promising solutions for robust speech recognition, especially when the audio signal is corrupted by noise. As the visual speech signal not affected by audio noise, it can be used to obtain more information used to enhance the speech recognition accuracy in noisy system. The critical stage in designing robust speech recognition system is choosing of reliable classification method from large variety of available classification techniques. Deep learning is well-known as a technique that has the ability to classify a nonlinear problem, and takes into consideration the sequential characteristic of the speech signal. Numerous researches have been done in applying deep learning to overcome Audio-Visual Speech Recognition (AVSR) problems due to its amazing achievements in both speech and image recognition. Even though optimistic results have been obtained from the continuous studies, researches on enhancing accuracy in noise system and selecting the best classification technique are still gaining lots of attention. Objective: This paper aims to build AVSR system that uses both acoustic combined with visual speech information and use classification technique based on deep learning to improve the recognition performance in a clean and noisy environment. Method: Mel frequency cepstral coefficient (MFCC) and Discrete Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. The audio feature rate is greater than the visual feature rate, so that linear interpolation is needed to obtain equal feature vectors size then early integrating them to get combined feature vector. Bidirectional Long-Short Term Memory (BiLSTM), one of the Deep learning techniques, are used for classification process and compare the obtained results to other classification techniques like Convolution Neural Network (CNN) and the traditional Hidden Markov Models (HMM). The effectiveness of the proposed model is proved by using two multi-speaker AVSR datasets termed AV letters and GRID. Results: The proposed model gives promising results where the obtained results In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.07% and 98.47% , with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. Conclusion: Based on the obtained results, we can conclude that increasing the size of audio feature vector from 13 to 39 doesn’t give effective enhancement for the recognition accuracy in clean environment, but in noisy environment, it gives better performance. BiLSTM is considered to be the optimal classifier for a robust speech recognition system when compared to CNN and traditional HMM, because it takes into consideration the sequential characteristic of the speech signal (audio and visual). The proposed model gives great improvement in the recognition accuracy and decreasing the loss value for both clean and noisy environments than using audio-only features. Comparing the proposed model to previously obtain results which using the same datasets, we found that our model gives higher recognition accuracy and confirms the robustness of our model.


Author(s):  
Lery Sakti Ramba

The purpose of this research is to design home automation system that can be controlled using voice commands. This research was conducted by studying other research related to the topics in this research, discussing with competent parties, designing systems, testing systems, and conducting analyzes based on tests that have been done. In this research voice recognition system was designed using Deep Learning Convolutional Neural Networks (DL-CNN). The CNN model that has been designed will then be trained to recognize several kinds of voice commands. The result of this research is a speech recognition system that can be used to control several electronic devices connected to the system. The speech recognition system in this research has a 100% success rate in room conditions with background intensity of 24dB (silent), 67.67% in room conditions with 42dB background noise intensity, and only 51.67% in room conditions with background intensity noise 52dB (noisy). The percentage of the success of the speech recognition system in this research is strongly influenced by the intensity of background noise in a room. Therefore, to obtain optimal results, the speech recognition system in this research is more suitable for use in rooms with low intensity background noise.


2020 ◽  
Vol 9 (1) ◽  
pp. 1022-1027

Driving a vehicle or a car has become tedious job nowadays due to heavy traffic so focus on driving is utmost important. This makes a scope for automation in Automobiles in minimizing human intervention in controlling the dashboard functions such as Headlamps, Indicators, Power window, Wiper System, and to make it possible this is a small effort from this paper to make driving distraction free using Voice controlled dashboard. and system proposed in this paper works on speech commands from the user (Driver or Passenger). As Speech Recognition system acts Human machine Interface (HMI) in this system hence this system makes use of Speaker recognition and Speech recognition for recognizing the command and recognize whether the command is coming from authenticated user(Driver or Passenger). System performs Feature Extraction and extracts speech features such Mel Frequency Cepstral Coefficients(MFCC),Power Spectral Density(PSD),Pitch, Spectrogram. Then further for Feature matching system uses Vector Quantization Linde Buzo Gray(VQLBG) algorithm. This algorithm makes use of Euclidean distance for calculating the distance between test feature and codebook feature. Then based on speech command recognized controller (Raspberry Pi-3b) activates the device driver for motor, Solenoid valve depending on function. This system is mainly aimed to work in low noise environment as most speech recognition systems suffer when noise is introduced. When it comes to speech recognition acoustics of the room matters a lot as recognition rate differs depending on acoustics. when several testing and simulation trials were taken for testing, system has speech recognition rate of 76.13%. This system encourages Automation of vehicle dashboard and hence making driving Distraction Free.


Geofluids ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Dongsheng Wang ◽  
Jun Feng ◽  
Xinpeng Zhao ◽  
Yeping Bai ◽  
Yujie Wang ◽  
...  

It is difficult to form a method for recognizing the degree of infiltration of a tunnel lining. To solve this problem, we propose a recognition method by using a deep convolutional neural network. We carry out laboratory tests, prepare cement mortar specimens with different saturation levels, simulate different degrees of infiltration of tunnel concrete linings, and establish an infrared thermal image data set with different degrees of infiltration. Then, based on a deep learning method, the data set is trained using the Faster R-CNN+ResNet101 network, and a recognition model is established. The experiments show that the recognition model established by the deep learning method can be used to select cement mortar specimens with different degrees of infiltration by using an accurately minimized rectangular outer frame. This model shows that the classification recognition model for tunnel concrete lining infiltration established by the indoor experimental method has high recognition accuracy.


Author(s):  
Mona Nagy ElBedwehy ◽  
G. M. Behery ◽  
Reda Elbarougy

Human emotion plays a major role in expressing their feelings through speech. Emotional speech recognition is an important research field in the human–computer interaction. Ultimately, the endowing machines that perceive the users’ emotions will enable a more intuitive and reliable interaction.The researchers presented many models to recognize the human emotion from the speech. One of the famous models is the Gaussian mixture model (GMM). Nevertheless, GMM may sometimes have one or more of its components as ill-conditioned or singular covariance matrices when the number of features is high and some features are correlated. In this research, a new system based on a weighted distance optimization (WDO) has been developed for recognizing the emotional speech. The main purpose of the WDO system (WDOS) is to address the GMM shortcomings and increase the recognition accuracy. We found that WDOS has achieved considerable success through a comparative study of all emotional states and the individual emotional state characteristics. WDOS has a superior performance accuracy of 86.03% for the Japanese language. It improves the Japanese emotion recognition accuracy by 18.43% compared with GMM and [Formula: see text]-mean.


2018 ◽  
Vol 15 (1) ◽  
pp. 172988141774948 ◽  
Author(s):  
Zhiqiang Liu ◽  
Jianqin Yin ◽  
Jinping Li ◽  
Jun Wei ◽  
Zhiquan Feng

One of the most important aspects of promoting the intelligence of home service robots is to reliably recognize human actions and accurately understand human behaviors and intentions. In the task of action recognition, there are many common ambiguous postures, which affect the recognition accuracy. To improve the reliability of the service provided by home service robots, this article presents a method of probabilistic soft-assignment recognition scheme based on Gaussian mixture models to recognize similar actions. First, we generate a representative posture dictionary based on the standard bag-of-words model; then, a Gaussian mixture model is introduced for the similar poses. Finally, combined with the Naive Bayesian principle, the method of weighted voting is used to recognize the action. The proposed scheme is verified by recognizing four types of daily actions, and the experimental results show its effectiveness.


Sign in / Sign up

Export Citation Format

Share Document