Expanding a Large Inclusive Study of Human Listening Rates

As conversational agents and digital assistants become increasingly pervasive, understanding their synthetic speech becomes increasingly important. Simultaneously, speech synthesis is becoming more sophisticated and manipulable, providing the opportunity to optimize speech rate to save users time. However, little is known about people’s abilities to understand fast speech. In this work, we provide an extension of the first large-scale study on human listening rates, enlarging the prior study run with 453 participants to 1,409 participants and adding new analyses on this larger group. Run on LabintheWild, it used volunteer participants, was screen reader accessible, and measured listening rate by accuracy at answering questions spoken by a screen reader at various rates. Our results show that people who are visually impaired, who often rely on audio cues and access text aurally, generally have higher listening rates than sighted people. The findings also suggest a need to expand the range of rates available on personal devices. These results demonstrate the potential for users to learn to listen to faster rates, expanding the possibilities for human-conversational agent interaction.

Download Full-text

Using an HPSG grammar for the generation of prosody

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar ◽

10.21248/hpsg.2007.5 ◽

2007 ◽

Author(s):

Berthold Crysmann ◽

Philipp Von Böselager

Keyword(s):

Speech Synthesis ◽

Large Scale ◽

Synthetic Speech ◽

Test Cases ◽

Synthesis System ◽

Testing Hypotheses ◽

Syntactic Structures ◽

Interface Module ◽

Syntactic Information

In this paper, we report on an experiment showing how the introduction of prosodic information from detailed syntactic structures into synthetic speech leads to better disambiguation of structurally ambiguous sentences. Using modifier attachment (MA) ambiguities and subject/object fronting (OF) in German as test cases, we show that prosody which is automatically generated from deep syntactic information provided by an HPSG generator can lead to considerable disambiguation effects, and can even override a strong semantics-driven bias. The architecture used in the experiment, consisting of the LKB generator running a large-scale grammar for German, a syntax-prosody interface module, and the speech synthesis system MARY is shown to be a valuable platform for testing hypotheses in intonation studies.

Download Full-text

The Effects of Modulating Fundamental Frequency and Speech Rate on the Intelligibility, Communication Efficiency, and Perceived Naturalness of Synthetic Speech

American Journal of Speech-Language Pathology ◽

10.1044/2019_ajslp-msc18-18-0052 ◽

2019 ◽

Vol 28 (2S) ◽

pp. 875-886 ◽

Cited By ~ 1

Author(s):

Jennifer M. Vojtech ◽

Jacob P. Noordzij ◽

Gabriel J. Cler ◽

Cara E. Stepp

Keyword(s):

Fundamental Frequency ◽

Slow Rate ◽

Speech Synthesis ◽

Speech Rate ◽

Synthetic Speech ◽

Normal Rate ◽

Synthesized Speech ◽

Sentence Level ◽

Communication Efficiency ◽

F0 Contour

Purpose This study investigated how modulating fundamental frequency (f0) and speech rate differentially impact the naturalness, intelligibility, and communication efficiency of synthetic speech. Method Sixteen sentences of varying prosodic content were developed via a speech synthesizer. The f0 contour and speech rate of these sentences were altered to produce 4 stimulus sets: (a) normal rate with a fixed f0 level, (b) slow rate with a fixed f0 level, (c) normal rate with prosodically natural f0 variation, and (d) normal rate with prosodically unnatural f0 variation. Sixteen listeners provided orthographic transcriptions and judgments of naturalness for these stimuli. Results Sentences with f0 variation were rated as more natural than those with a fixed f0 level. Conversely, sentences with a fixed f0 level demonstrated higher intelligibility than those with f0 variation. Speech rate did not affect the intelligibility of stimuli with a fixed f0 level. Communication efficiency was highest for sentences produced at a normal rate and a fixed f0 level. Conclusions Sentence-level f0 variation increased naturalness ratings of synthesized speech, whether the variation was prosodically natural or not. However, these f0 variations reduced intelligibility. There is evidence of a trade-off in naturalness and intelligibility of synthesized speech, which may impact future speech synthesis designs. Supplemental Material https://doi.org/10.23641/asha.8847833

Download Full-text

"Nobody Speaks that Fast!" An Empirical Study of Speech Rate in Conversational Agents for People with Vision Impairments

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems ◽

10.1145/3313831.3376569 ◽

2020 ◽

Author(s):

Dasom Choi ◽

Daehyun Kwak ◽

Minji Cho ◽

Sangsu Lee

Keyword(s):

Empirical Study ◽

Speech Rate ◽

Conversational Agents

Download Full-text

Intersubjectivity in human–agent interaction

Interaction Studies ◽

10.1075/is.8.3.05cas ◽

2007 ◽

Vol 8 (3) ◽

pp. 391-410 ◽

Cited By ~ 20

Author(s):

Justine Cassell ◽

Andrea Tartaro

Keyword(s):

Human Robot Interaction ◽

Embodied Conversational Agents ◽

Conversational Agents ◽

Robot Interaction ◽

Human Agent ◽

Unconscious Communication ◽

Agent Interaction ◽

And Behavior ◽

And Robotics

What is the hallmark of success in human–agent interaction? In animation and robotics, many have concentrated on the looks of the agent — whether the appearance is realistic or lifelike. We present an alternative benchmark that lies in the dyad and not the agent alone: Does the agent’s behavior evoke intersubjectivity from the user? That is, in both conscious and unconscious communication, do users react to behaviorally realistic agents in the same way they react to other humans? Do users appear to attribute similar thoughts and actions? We discuss why we distinguish between appearance and behavior, why we use the benchmark of intersubjectivity, our methodology for applying this benchmark to embodied conversational agents (ECAs), and why we believe this benchmark should be applied to human–robot interaction.

Download Full-text

Large-scale subjective evaluations of speech rate control methods for HMM-based speech synthesizers

10.21437/interspeech.2011-44 ◽

2011 ◽

Author(s):

Tsuneo Kato ◽

Makoto Yamada ◽

Nobuyuki Nishizawa ◽

Keiichiro Oura ◽

Keiichi Tokuda

Keyword(s):

Rate Control ◽

Large Scale ◽

Speech Rate ◽

Control Methods ◽

Subjective Evaluations

Download Full-text

Automatic Face Image Tagging in Large Collections

Face Recognition in Adverse Conditions - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-4666-5966-7.ch016 ◽

2014 ◽

pp. 336-358

Author(s):

Silvio Barra ◽

Maria De Marsico ◽

Chiara Galdi

Keyword(s):

Social Networks ◽

Face Recognition ◽

Large Scale ◽

Face Image ◽

Image Tagging ◽

Indexing And Retrieval ◽

Magazine Publishing ◽

Face Datasets ◽

Traditional Approaches ◽

Personal Devices

In this chapter, the authors present some issues related to automatic face image tagging techniques. Their main purpose in user applications is to support the organization (indexing) and retrieval (or easy browsing) of images or videos in large collections. Their core modules include algorithms and strategies for handling very large face databases, mostly acquired in real conditions. As a background for understanding how automatic face tagging works, an overview about face recognition techniques is given, including both traditional approaches and novel proposed techniques for face recognition in uncontrolled settings. Moreover, some applications and the way they work are summarized, in order to depict the state of the art in this area of face recognition research. Actually, many of them are used to tag faces and to organize photo albums with respect to the person(s) presented in annotated photos. This kind of activity has recently expanded from personal devices to social networks, and can also significantly support more demanding tasks, such as automatic handling of large editorial collections for magazine publishing and archiving. Finally, a number of approaches to large-scale face datasets as well as some automatic face image tagging techniques are presented and compared. The authors show that many approaches, both in commercial and research applications, still provide only a semi-automatic solution for this problem.

Download Full-text

Embodied Conversation

Encyclopedia of Organizational Knowledge, Administration, and Technology - Advances in Logistics, Operations, and Management Science ◽

10.4018/978-1-7998-3473-1.ch076 ◽

2021 ◽

pp. 1091-1107

Author(s):

Andrej Zgank ◽

Izidor Mlakar ◽

Uros Berglez ◽

Danilo Zimsek ◽

Matej Borko ◽

...

Keyword(s):

Language Processing ◽

Automatic Speech Recognition ◽

Speech Synthesis ◽

Conversational Agents ◽

Embodied Conversational Agent ◽

Computer Interfaces ◽

Human Computer Interfaces ◽

Text To Speech Synthesis ◽

High Level

The chapter presents an overview of human-computer interfaces, which are a crucial element of an ambient intelligence solution. The focus is given to the embodied conversational agents, which are needed to communicate with users in a most natural way. Different input and output modalities, with supporting methods, to process the captured information (e.g., automatic speech recognition, gesture recognition, natural language processing, dialog processing, text to speech synthesis, etc.), have the crucial role to provide the high level of quality of experience to the user. As an example, usage of embodied conversational agent for e-Health domain is proposed.

Download Full-text

Speech Synthesis of Emotions Using Vowel Features

International Journal of Software Innovation ◽

10.4018/ijsi.2013010105 ◽

2013 ◽

Vol 1 (1) ◽

pp. 54-67

Author(s):

Kanu Boku ◽

Taro Asada ◽

Yasunari Yoshitomi ◽

Masayoshi Tabuse

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Male Subject ◽

Maximum Amplitude ◽

Synthetic Speech ◽

Emotional Speech ◽

Prosodic Features ◽

Initial Investigation ◽

Synthesis Research ◽

Case Based

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”

Download Full-text

Text-to-Speech Synthesis

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch135 ◽

2011 ◽

pp. 957-963

Author(s):

Mahbubur R. Syed ◽

Shuvro Chakrobartty ◽

Robert J. Bignall

Keyword(s):

Speech Production ◽

Speech Synthesis ◽

Synthetic Speech ◽

Practical Application ◽

Text To Speech ◽

Synthesis System ◽

System A ◽

Vocal System ◽

Text To Speech Synthesis ◽

Computer Based

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.

Download Full-text

Large-scale Collection and Analysis of Personal Question-Answer Pairs for Conversational Agents

Intelligent Virtual Agents - Lecture Notes in Computer Science ◽

10.1007/978-3-319-09767-1_53 ◽

2014 ◽

pp. 420-433 ◽

Cited By ~ 8

Author(s):

Hiroaki Sugiyama ◽

Toyomi Meguro ◽

Ryuichiro Higashinaka ◽

Yasuhiro Minami

Keyword(s):

Large Scale ◽

Conversational Agents

Download Full-text