Speech transformation solutions

This paper outlines the background development of “intelligent” technologies such as speech recognition. Despite significant progress in the development of these technologies, they still fall short in many areas, and rapid advances in areas such as dictation are actually stalled. In this paper we have proposed semi-automatic solutions — smart integration of human and intelligent efforts. One such technique involves improvement to the speech recognition editing interface, thereby reducing the perception of errors to the viewer. Other techniques that are described in the paper are batch enrollment, which allows the user to reduce the amount of time required for enrollment, and content spotting, which can be used for applications that have repeated content flow, such as movies or museum tours. The paper also suggests a general concept of distributive training of speech recognition systems that is based on data collection across a network.

Download Full-text

Automatic Speech Recognition Applications: A Study of Methods for Defining Command Vocabularies

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/154193129503900307 ◽

1995 ◽

Vol 39 (3) ◽

pp. 203-207

Author(s):

Roger B. Garberg

Keyword(s):

Speech Recognition ◽

Data Collection ◽

Automatic Speech Recognition ◽

Service Operations ◽

Critical Dimension ◽

Data Collection And Analysis ◽

Cognitive Activities ◽

Time Required ◽

Request Service ◽

Divide Attention

Phoneme-based automatic speech recognition (ASR) technology enables designers to easily create custom command words or phrases that users can employ to request service operations. In this paper, I report results from two experiments concerning important dimensions of these ASR command vocabularies, including command naturalness/appropriateness and command recallability. Ease of recall is a critical dimension for assessing ASR commands used in multi-step applications since service subscribers may be engaged in several different cognitive activities that divide attention. Yet techniques for measuring command recallability can be difficult to implement owing to the time required for data collection and analysis. Results of these studies indicate the the dimension of command “naturalness” and memorability are closely related: under appropriate conditions, the simple procedures associated with measuring command naturalness or appropriateness can predict retrievability of command expressions.

Download Full-text

Using an Avatar, RFID & speech recognition for hands-free data collection

PsycEXTRA Dataset ◽

10.1037/e578312012-011 ◽

2008 ◽

Author(s):

Kristie Nemeth ◽

Nicole Arbuckle ◽

Andrea Snead ◽

Drew Bowers ◽

Christopher Burneka ◽

...

Keyword(s):

Speech Recognition ◽

Data Collection ◽

Free Data

Download Full-text

Exploring E2E speech recognition systems for new languages

10.21437/iberspeech.2018-22 ◽

2018 ◽

Cited By ~ 1

Author(s):

Conrad Bernath ◽

Aitor Alvarez ◽

Haritz Arzelus ◽

Carlos David Martínez

Keyword(s):

Speech Recognition ◽

Recognition Systems

Download Full-text

Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

10.21437/interspeech.2019-2112 ◽

2019 ◽

Cited By ~ 2

Author(s):

Sheng Li ◽

Dabre Raj ◽

Xugang Lu ◽

Peng Shen ◽

Tatsuya Kawahara ◽

...

Keyword(s):

Speech Recognition ◽

Recognition Systems

Download Full-text

SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems

10.21437/interspeech.2020-2787 ◽

2020 ◽

Author(s):

Huili Chen ◽

Bita Darvish ◽

Farinaz Koushanfar

Keyword(s):

Speech Recognition ◽

Ip Protection ◽

Recognition Systems

Download Full-text

Human-robot-interaction using cloud-based speech recognition systems

Procedia CIRP ◽

10.1016/j.procir.2020.05.214 ◽

2021 ◽

Vol 97 ◽

pp. 130-135

Author(s):

Christian Deuerlein ◽

Moritz Langer ◽

Julian Seßner ◽

Peter Heß ◽

Jörg Franke

Keyword(s):

Speech Recognition ◽

Human Robot Interaction ◽

Robot Interaction ◽

Recognition Systems

Download Full-text

Development of Speech Recognition Systems in Emergency Call Centers

Symmetry ◽

10.3390/sym13040634 ◽

2021 ◽

Vol 13 (4) ◽

pp. 634

Author(s):

Alakbar Valizada ◽

Natavan Akhundova ◽

Samir Rustamov

Keyword(s):

Speech Recognition ◽

Markov Model ◽

Hidden Markov ◽

Call Centers ◽

The Other ◽

Language Models ◽

Emergency Call ◽

Acoustic Model ◽

Different Types ◽

Recognition Systems

In this paper, various methodologies of acoustic and language models, as well as labeling methods for automatic speech recognition for spoken dialogues in emergency call centers were investigated and comparatively analyzed. Because of the fact that dialogue speech in call centers has specific context and noisy, emotional environments, available speech recognition systems show poor performance. Therefore, in order to accurately recognize dialogue speeches, the main modules of speech recognition systems—language models and acoustic training methodologies—as well as symmetric data labeling approaches have been investigated and analyzed. To find an effective acoustic model for dialogue data, different types of Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) and Deep Neural Network/Hidden Markov Model (DNN/HMM) methodologies were trained and compared. Additionally, effective language models for dialogue systems were defined based on extrinsic and intrinsic methods. Lastly, our suggested data labeling approaches with spelling correction are compared with common labeling methods resulting in outperforming the other methods with a notable percentage. Based on the results of the experiments, we determined that DNN/HMM for an acoustic model, trigram with Kneser–Ney discounting for a language model and using spelling correction before training data for a labeling method are effective configurations for dialogue speech recognition in emergency call centers. It should be noted that this research was conducted with two different types of datasets collected from emergency calls: the Dialogue dataset (27 h), which encapsulates call agents’ speech, and the Summary dataset (53 h), which contains voiced summaries of those dialogues describing emergency cases. Even though the speech taken from the emergency call center is in the Azerbaijani language, which belongs to the Turkic group of languages, our approaches are not tightly connected to specific language features. Hence, it is anticipated that suggested approaches can be applied to the other languages of the same group.

Download Full-text