speech recognizer
Recently Published Documents


TOTAL DOCUMENTS

276
(FIVE YEARS 19)

H-INDEX

21
(FIVE YEARS 1)

2022 ◽  
Vol 15 ◽  
Author(s):  
Enrico Varano ◽  
Konstantinos Vougioukas ◽  
Pingchuan Ma ◽  
Stavros Petridis ◽  
Maja Pantic ◽  
...  

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.


2021 ◽  
Author(s):  
Enrico Varano ◽  
Konstantinos Vougioukas ◽  
Pingchuan Ma ◽  
Stavros Petridis ◽  
Maja Pantic ◽  
...  

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speake's face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person's face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.


Author(s):  
Sohan Singh & Anupam Lakhanpal. Shashwat Shukla., Srishti Sinha.,

“Jarvis” was main character of Tony’s Stark’s life assistant in Movies Iron Man. Unlike original comic in which Jarvis was Stark’s human butler, the movie version of Jarvis is an intelligent computer that converses with stark, monitors his household and help to build and program his superhero suit. In this Project Jarvis is Digital Life Assistant which uses mainly human communication means such Twitter, instant message and voice to create two way connections between human and his apartment, controlling lights and appliances, assist in cooking, notify him of breaking news, Facebook’s Notifications and many more. In our project we mainly use voice as communication means so the Jarvis is basically the Speech recognition application. The concept of speech technology really encompasses two technologies: Synthesizer and recognizer. A speech synthesizer takes as input and produces an audio stream as output. A speech recognizer on the other hand does opposite. It takes an audio stream as input and thus turns it into text transcription. The voice is a signal of infinite information. A direct analysisand synthesizing the complex voice signal is due to too much information contained in the signal. Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal. In this project we directly use speech engine which use Feature extraction technique as Mel scaled frequency cepstral. The mel- scaled frequency cepstral coefficients (MFCCs) derived from Fourier transform and filter bank analysis are perhaps the most widely used front- ends in state-of-the-art speech recognition systems. Our aim to create more and more functionalities which can help human to assist in their daily life and also reduces their efforts. In our test we check all this functionality is working properly. We test this on 2 speakers(1 Female and 1 Male) for accuracy purpose.


2021 ◽  
Vol 38 (2) ◽  
pp. 349-358
Author(s):  
Bilal Dendani ◽  
Halima Bahi ◽  
Toufik Sari

Mobile speech recognition attracts much attention in the ubiquitous context, however, background noises, speech coding, and transmission errors are prone to corrupt the incoming speech. Therein, building a robust speech recognizer requires the availability of a large number of real-world speech samples. Arabic language, like many other languages, lacks such resources; to overcome this limitation, we propose a speech enhancement step, before the recognition begins. For the speech enhancement purpose, we suggest the use of a deep autoencoder (DAE) algorithm. A two-step procedure is suggested: in the first step, an overcomplete DAE is trained in an unsupervised way, and in the second one, a denoising DAE is trained in a supervised way leveraging the clean speech produced in the previous step. Experimental results performed on a real-life mobile database confirmed the potentials of the proposed approach and show a reduction of the WER (Word Error Rate) of a ubiquitous Arabic speech recognizer. Further experiments show an improvement of the perceptual evaluation of speech quality (PESQ), and the short-time objective intelligibility (STOI) as well.


2021 ◽  
Vol 3 (1) ◽  
pp. 68-83
Author(s):  
Wiqas Ghai ◽  
Navdeep Singh

Punjabi language is a tonal language belonging to an Indo-Aryan language family and has a number of speakers all around the world. Punjabi language has gained acceptability in the media & communication and therefore deserves to have a place in the growing field of automatic speech recognition which has been explored already for a number of other Indian and foreign languages successfully. Some work has been done in the field of isolated word speech recognition for Punjabi language, but only using whole word based acoustic models. A phone based approach has yet to be applied for Punjabi language speech recognition. This paper describes an automatic speech recognizer that recognizes isolated word speech and connected word speech using a triphone based acoustic model on the HTK 3.4.1 speech Engine and compares the performance with acoustic whole word model based ASR system. Word recognition accuracy of isolated word speech was 92.05% for acoustic whole word model based system and 97.14% for acoustic triphone model based system whereas word recognition accuracy of connected word speech was 87.75% for acoustic whole word model based system and 91.62% for acoustic triphone model based system.


2020 ◽  
Author(s):  
Yuan Shangguan ◽  
Kate Knister ◽  
Yanzhang He ◽  
Ian McGraw ◽  
Françoise Beaufays
Keyword(s):  

2020 ◽  
Vol 25 (3) ◽  
pp. 93-98
Author(s):  
Kyu-Seok Kim

Real-time voice translation systems receive a speaker s voice and translate their speech into another language. However, the meaning of a whole Korean sentence can be unintentionally changed because Korean words and syllables can be merged or divided by spaces. Therefore, the spaces between the speaker s sentences are occasionally not identified by the speech recognition system, so the translated sentences are sometimes incorrect. This paper presents a methodology to enhance the accuracy of voice translation by adding intentional spaces. An Android application was implemented using Google speech recognizer for Android and Google translator for the Web. The Google speech recognizer app for Android receives the speaker s voice sentences in Korean and shows the text results. Next, the proposed Android application adds spaces when the speaker speaks the dedicated word for the space. Finally, the modified Korean sentences are translated into English by Google translator for the Web. Using this method can enhance interpretation accuracy for translation systems.


Sign in / Sign up

Export Citation Format

Share Document