Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots

Speech recognition for intelligent robots seems to suffer from performance degradation due to ego-noise. The ego-noise is caused by the motors, fans, and mechanical parts inside the intelligent robots especially when the robot moves or shakes its body. To overcome the problems caused by the ego-noise, we propose a robust speech recognition algorithm that uses motor-state information of the robot as an auxiliary feature. For this, we use two deep neural networks (DNN) in this paper. Firstly, we design the latent features using a bottleneck layer, one of the internal layers having a smaller number of hidden units relative to the other layers, to represent whether the motor is operating or not. The latent features maximizing the representation of the motor-state information are generated by taking the motor data and acoustic features as the input of the first DNN. Secondly, once the motor-state dependent latent features are designed at the first DNN, the second DNN, accounting for acoustic modeling, receives the latent features as the input along with the acoustic features. We evaluated the proposed system on LibriSpeech database. The proposed network enables efficient compression of the acoustic and motor-state information, and the resulting word error rate (WER) are superior to that of a conventional speech recognition system.

Download Full-text

Enhancing robustness of zero resource children's speech recognition system through bispectrum based front-end acoustic features

Digital Signal Processing ◽

10.1016/j.dsp.2021.103226 ◽

2021 ◽

pp. 103226

Author(s):

S. Shahnawazuddin ◽

Avinash Kumar ◽

Saurabh Kumar ◽

Waquar Ahmad

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Acoustic Features ◽

Front End ◽

Children’S Speech Recognition ◽

Children's Speech

Download Full-text

Research and Realization on the Voice Command Recognition System for Robot Control Based on ARM9

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.1422 ◽

2010 ◽

Vol 44-47 ◽

pp. 1422-1426

Author(s):

Mei Juan Gao ◽

Zhi Xin Yang

Keyword(s):

Control System ◽

Speech Recognition ◽

Mobile Robot ◽

Robot Control ◽

Linear Prediction ◽

Recognition System ◽

Recognition Algorithm ◽

Speech Recognition System ◽

Mobile Robot Control ◽

Robot Control System

In this paper, based on the study of two speech recognition algorithms, two designs of speech recognition system are given to realize this isolated speech recognition mobile robot control system based on ARM9 processor. The speech recognition process includes pretreatment of speech signal, characteristic extrication, pattern matching and post-processing. Mel-Frequency cepstrum coefficients (MFCC) and linear prediction cepstrum coefficients (LPCC) are the two most common parameters. Through analysis and comparison the parameters, MFCC shows more noise immunity than LPCC, so MFCC is selected as the characteristic parameters. Both dynamic time warping (DTW) and hidden markov model (HMM) are commonly used algorithm. For the different characteristics of DTW and HMM recognition algorithm, two different programs were designed for mobile robot control system. The effect and speed of the two speech recognition system were analyzed and compared.

Download Full-text

Arabic Phonetic Dictionaries for Speech Recognition

Journal of Information Technology Research ◽

10.4018/jitr.2009062905 ◽

2009 ◽

Vol 2 (4) ◽

pp. 67-80 ◽

Cited By ~ 15

Author(s):

Mohamed Ali ◽

Moustafa Elshafei ◽

Mansour Al-Ghamdi ◽

Husni Al-Muhtaseb

Keyword(s):

Speech Recognition ◽

Language Model ◽

Recognition System ◽

Speech Recognition System ◽

Broadcast News ◽

Word Error Rate ◽

Large Vocabulary ◽

Essential Components ◽

News Corpus ◽

Arabic Speech Recognition

Phonetic dictionaries are essential components of large-vocabulary speaker-independent speech recognition systems. This paper presents a rule-based technique to generate phonetic dictionaries for a large vocabulary Arabic speech recognition system. The system used conventional Arabic pronunciation rules, common pronunciation rules of Modern Standard Arabic, as well as some common dialectal cases. The paper gives in detail an explanation of these rules as well as their formal mathematical presentation. The rules were used to generate a dictionary for a 5.4 hour corpus of broadcast news. The rules and the phone set were tested and evaluated on an Arabic speech recognition system. The system was trained on 4.3 hours of the 5.4 hours of Arabic broadcast news corpus and tested on the remaining 1.1 hours. The phonetic dictionary contains 23,841 definitions corresponding to about 14232 words. The language model contains both bi-grams and tri-grams. The Word Error Rate (WER) came to 9.0%.

Download Full-text

Design and Implementation of Speech Recognition System Based on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.416-417.1156 ◽

2013 ◽

Vol 416-417 ◽

pp. 1156-1159

Author(s):

Bo Nian Yi

Keyword(s):

Speech Recognition ◽

Information Technologies ◽

Voice Recognition ◽

Recognition System ◽

Recognition Algorithm ◽

Characteristic Parameters ◽

The Core ◽

New Information ◽

And Control ◽

The Voice

Speech recognition technology is one of the hottest and the most promising new information technologies in the world. This paper studied the voice pretreatment and extractions of MFCC characteristic parameters, constructed speech keywords recognition algorithm with the core of the VQ model and the HMM model, using MATLAB to complete the training and simulation of algorithm, FPGA-based voice recognition technology, and the simulation and implementation of its hardware and software. It laid the foundation for the realization of speech recognition and control based FPGA.

Download Full-text

Automatic transcription of the Polish newsreel

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0008 ◽

2019 ◽

Vol 55 (2) ◽

pp. 183-209

Author(s):

Danijel Koržinek ◽

Krzysztof Wołk ◽

Łukasz Brocki ◽

Krzysztof Marasek

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Language Models ◽

Late 20Th Century ◽

Word Error Rate ◽

Audio Quality ◽

Transcription System ◽

Automatic Transcription ◽

Using Data ◽

Optimal Adaptation

Abstract This paper describes an automatic transcription system for the Polish Newsreel, which is a collection of mid to late 20th century news segments presented in audio and video form. They are characterized by their use of archaic language and poor audio quality, which makes them a demanding problem for speech recognition systems. Acoustic and language models had to be retrained using data from in-domain corpora. During the adaptation of the models, experiments were carried out to select optimal adaptation parameters. The experiments showed that the adaptation of the speech recognition system to a narrow and clearly defined domain significantly increases its efficiency. The final word error rate obtained for this domain was 10.97%.

Download Full-text

Implementation of Embedded Unspecific Continuous English Speech Recognition Based on HMM

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096514666210715144717 ◽

2021 ◽

Vol 14 ◽

Author(s):

Xiaoli Lu ◽

Mohd Asif Shah

Keyword(s):

Speech Recognition ◽

Recognition Accuracy ◽

Recognition System ◽

Vital Role ◽

Recognition Algorithm ◽

Theoretical Research ◽

Fast Speed ◽

Conversational Interfaces ◽

Recognition Function ◽

High Recognition Accuracy

Background: Human-computer interaction plays a vital role through Natural Language Conversational Interfaces to improve the usage of computers. Speech recognition technology allows the machine to understand human language. A speech recognition algorithm is used to achieve this function. Methodology: This paper is mainly based on the fundamental theoretical research of speech signals, establishes the HMM model, uses speech collection, recognition, and other methods, simulates on MATLAB, and integrates the recognition system ported to ARM for debugging and running to realize the embedded speech recognition function based on HMM under the ARM platform. Conclusion: The conclusion shows that the HMM-based embedded unspecific continuous English speech recognition system has high recognition accuracy and fast speed.

Download Full-text

Mobile Image Multi-label Recognition Algorithm Based on PaddlePaddle Platform

Journal of Physics Conference Series ◽

10.1088/1742-6596/2066/1/012046 ◽

2021 ◽

Vol 2066 (1) ◽

pp. 012046

Author(s):

Yuanyi Chen

Keyword(s):

Recognition System ◽

Reading Rate ◽

Recognition Algorithm ◽

Space Diversity ◽

Frequency Diversity ◽

Key Factors ◽

Existing Problems ◽

Learning Framework ◽

Intelligent Robots ◽

Moving Images

Abstract As one of the core algorithms of machine vision, the mobile image multi-label recognition algorithm has received extensive attention from researchers in recent years and has been widely used in cutting-edge fields such as deep learning framework paddlepaddle platform, video surveillance, intelligent robots, and unmanned aerial vehicles. However, the existing recognition algorithms are not completely satisfied with the practical application in life and production. Due to the complexity of the platform environment, they can often only propose specific solutions based on existing problems, and there is no universal algorithm that is suitable for all kinds of Complex environment. The purpose of this paper is to study the multi-label recognition algorithm of moving images based on PaddlePaddle platform. This research mainly analyzes and researches the mobile image multi-tag space deployment plan and the multi-tag recognition algorithm, and further improves the tag reading rate and recognition reliability of the mobile image on the PaddlePaddle platform. This research first analyzes several key factors that affect the performance of UHF recognition system, considers the improvement plan of PaddlePaddle platform’s mobile image multi-tag recognition algorithm from the two aspects of space diversity and frequency diversity, and finally determines the multiple The label space diversity scheme, and the introduction of a multi-label optimization recognition algorithm to improve the recognition efficiency of the PaddlePaddle platform’s mobile image multi-label. Experimental data shows that the reading rate can reach 0.907 when identifying 300 tags in the experiment, and when the number of tags is greater than 300, the reading rate is close to 1, which verifies that the algorithm proposed in this paper is used in the multi-tag recognition of moving images on the PaddlePaddle platform.

Download Full-text

English Speech Feature Recognition Based On Digital Means

10.21203/rs.3.rs-941510/v1 ◽

2021 ◽

Author(s):

Yuji miao ◽

Yanan Huang ◽

Zhenjing Da

Keyword(s):

Speech Recognition ◽

Feature Recognition ◽

Recognition System ◽

Recognition Algorithm ◽

Time Frequency ◽

Chaotic Signals ◽

Speech Feature ◽

Speech Features ◽

Fuzzy Recognition ◽

Digital Algorithm

Abstract In order to improve the effect of English speech recognition, based on digital means, this paper combines the actual needs of English speech feature recognition to improve the digital algorithm. Moreover, this paper combines fuzzy recognition algorithm to analyze English speech features, and analyzes the shortcomings of traditional algorithms, and proposes the fuzzy digitized English speech recognition algorithm, and builds an English speech feature recognition model on this basis. In addition, this paper conducts time-frequency analysis on chaotic signals and speech signals, eliminates noise in English speech features, improves the recognition effect of English speech features, and builds an English speech feature recognition system based on digital means. Finally, this paper conducts grouping experiments by inputting students' English pronunciation forms, and counts the results of the experiments to test the performance of the system. The research results show that the method proposed in this paper has a certain effect.

Download Full-text

Langkah Praktis Membangun Sistem Pengenalan Suara dengan HTK

JSAI (Journal Scientific and Applied Informatics) ◽

10.36085/jsai.v2i2.314 ◽

2019 ◽

Vol 2 (2) ◽

pp. 149-153

Author(s):

Zulkarnaen Hatala

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Principle Component Analysis ◽

Hidden Markov ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Word Error Rate ◽

Bahasa Indonesia

Dipaparkan prosedur untuk mengembangkan Sistem Pengenalan Suara otomatis, Automatic Speech Recognition System (ASR) untuk kasus online recognition. Prosedur ini secara cepat dan efisien membangun ASR menggunakan Hidden Markov Toolkit (HTK). Langkah-langkah praktis ini dipaparkan secara jelas untuk mengimplementasikan ASR dengan daftar kata sedikit (Small Vocabulary) dalam contoh kasus pengenalan digit Bahasa Indonesia. Dijelaskan beberapa teknik meningkatkan performansi seperti cara mengatasi noise, pengejaan ganda dan penerapan Principle Component Analysis. Hasil akhir berupa Word Error Rate

Download Full-text

Development of End – to – End Encoder - Decoder Model Applying Voice Recognition System in Different Channels

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1267.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2350-2352

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Voice Recognition ◽

Ground Truth ◽

Recognition System ◽

Training Algorithms ◽

Word Error Rate ◽

End To End ◽

Evaluation Metric

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.

Download Full-text