speech recognition engine
Recently Published Documents


TOTAL DOCUMENTS

31
(FIVE YEARS 11)

H-INDEX

4
(FIVE YEARS 1)

Author(s):  
Tristan J. Mahr ◽  
Visar Berisha ◽  
Kan Kawabata ◽  
Julie Liss ◽  
Katherine C. Hustad

Purpose Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material https://doi.org/10.23641/asha.14167058


2021 ◽  
Vol 343 ◽  
pp. 04003
Author(s):  
Valentin-Cătălin Govoreanu ◽  
Adrian-Nicolae Ţocu ◽  
Alin-Marius Cruceat ◽  
Dragoş Circa

This paper presents a speech recognition service used in the context of commanding and guiding the activities around an industrial training station. The entire concept is built on a decentralized microservice architecture and one of the many hardware and software components is the speech recognition engine. This engine grants users the possibility to interact seamlessly with other components in order to ensure a gradual and productive learning process. By working with different API’s for both English and Romanian languages, the presented approach manages to obtain good speech recognition for defining task phrases aiding the training procedure and to reduce the recognition required time by almost 50%.


2020 ◽  
Author(s):  
Tristan Mahr ◽  
Visar Berisha ◽  
Kan Kawabata ◽  
Julie Liss ◽  
Katherine Hustad

Aim. We compared the performance of five forced-alignment algorithms on a corpus of child speech.Method. The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals.Results. The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Interpretation. The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors.


Author(s):  
Sriraksha Nayak ◽  
Chandrakala C B

According to the World Health Organization estimation, globally the number of people with some visual impairment is estimated to be 285 million, of whom 39 million are blind.  The inability to use features such as sending and reading of email, schedule management, pathfinding or outdoor navigation, and reading SMS is a disadvantage for blind people in many professional and educational situations. Speech or text analysis can help improve support for visually-impaired people. Users can speak a command to perform a task. The spoken command will be interpreted by the Speech Recognition Engine (SRE) and can be converted into text or perform suitable actions. In this paper, an application that allows schedule management, emailing, and SMS reading completely based on voice command is proposed, implemented, and validated. The System hopes to provide blind people to simply speak the desired functionality and be guided thereby the system’s audio instructions. The proposed and designed app is implemented to support three languages which are English, Hindi, and Kannada.


Author(s):  
Tianyun Li ◽  
Bicheng Fan

This study sets out to describe simultaneous interpreters' attention-sharing initiatives when exposed under input from both videotaped speech recording and real-time transcriptions. Separation of mental energy in acquiring visual input accords with the human brain's statistic optimization principle where the same property of an object is presented through diverse fashions. In examining professional interpreters' initiatives, the authors invited five professional English-Chinese conference interpreters to simultaneously interpret a videotaped speech with real-time captions generated by speech recognition engine while meanwhile monitoring their eye movements. The results indicate the professional interpreters' preferences in referring to visually presented captions along with the speaker's facial expressions, where low-frequency words, proper names, and numbers gained greater attention than words with higher frequency. This phenomenon might be explained by the working memory theory in which the central executive enables redundancy gains retrieved from dual-channel information.


Cooperative manipulators are among the subject of interest in the scientific community for the last few years. Here an overview of the design and control of such cooperative manipulators using Speech Commands in English, Hindi, and Tamil is discussed. Here we choose two identical Robot arms from lynxmotion, and both manipulators move in conjunction with one another to achieve more payload while grasping or handling the object by the end effector. The simultaneous control of identical robot manipulators could be performed by pronouncing simple speech commands by the end user using a smartphone, which then is converted into text format using a speech recognition engine and this text fed to servo controller helps in actuating the joints of identical robot arms. Cooperative manipulators are used for handling radioactive elements and also in the field of medicine as rehabilitation aid and also in surgeries. An Android app specifically built for this purpose communicates through Bluetooth technology makes the interface for end-user simple to control both identical robot arms simultaneously.


Author(s):  
Basanta Kuamr Swain ◽  
Sanghamitra Mohanty ◽  
Chiranji Lal Chowdhary

: In this research paper, we have developed a spoken dialogue system using Odia phone set. We have also added additional security feature to our developed spoken dialogue system by integrating with speaker verification module, which allows the services to only genuine users. The spoken dialogue system can give the bouquet of services relating to opening of frequently usage applications, files and folders that are either installed or stored in user’s computers. The spoken dialogue system also responds to the users in synthesized speech mode relating to the service. The spoken dialogue system can be used to keep the desktop of computer from free of clutter. We have used HMM based Odia isolated word speech recognition engine and fuzzy c-means based speaker verification module in development of spoken dialogue system. The accuracy of Odia speech recognition engine is found as 78.22 % and 62.31% for seen and unseen users respectively and the average accuracy rate of speaker verification module is found as 66.2%.


This paper discusses the challenges and proposes recommendations on using a standard speech recognition engine for a small vocabulary Air Traffic Controller Pilot communication domain. With the given challenges in transcribing the Air Traffic Communication due to the inherent radio issues in cockpit and the con-troller room, gathering the corpus for training the speech recognition model is another important problem. Taking advantage of the maturity of today’s speech recognition systems for the standard English words used in the communication, this paper focusses on the challenges in decoding the domain specific named entity words used in the communication.


Sign in / Sign up

Export Citation Format

Share Document