Expanding a Large Inclusive Study of Human Listening Rates

2021 ◽  
Vol 14 (3) ◽  
pp. 1-26
Author(s):  
Danielle Bragg ◽  
Katharina Reinecke ◽  
Richard E. Ladner

As conversational agents and digital assistants become increasingly pervasive, understanding their synthetic speech becomes increasingly important. Simultaneously, speech synthesis is becoming more sophisticated and manipulable, providing the opportunity to optimize speech rate to save users time. However, little is known about people’s abilities to understand fast speech. In this work, we provide an extension of the first large-scale study on human listening rates, enlarging the prior study run with 453 participants to 1,409 participants and adding new analyses on this larger group. Run on LabintheWild, it used volunteer participants, was screen reader accessible, and measured listening rate by accuracy at answering questions spoken by a screen reader at various rates. Our results show that people who are visually impaired, who often rely on audio cues and access text aurally, generally have higher listening rates than sighted people. The findings also suggest a need to expand the range of rates available on personal devices. These results demonstrate the potential for users to learn to listen to faster rates, expanding the possibilities for human-conversational agent interaction.

Author(s):  
Berthold Crysmann ◽  
Philipp Von Böselager

In this paper, we report on an experiment showing how the introduction of prosodic information from detailed syntactic structures into synthetic speech leads to better disambiguation of structurally ambiguous sentences. Using modifier attachment (MA) ambiguities and subject/object fronting (OF) in German as test cases, we show that prosody which is automatically generated from deep syntactic information provided by an HPSG generator can lead to considerable disambiguation effects, and can even override a strong semantics-driven bias. The architecture used in the experiment, consisting of the LKB generator running a large-scale grammar for German, a syntax-prosody interface module, and the speech synthesis system MARY is shown to be a valuable platform for testing hypotheses in intonation studies.


2019 ◽  
Vol 28 (2S) ◽  
pp. 875-886 ◽  
Author(s):  
Jennifer M. Vojtech ◽  
Jacob P. Noordzij ◽  
Gabriel J. Cler ◽  
Cara E. Stepp

Purpose This study investigated how modulating fundamental frequency (f0) and speech rate differentially impact the naturalness, intelligibility, and communication efficiency of synthetic speech. Method Sixteen sentences of varying prosodic content were developed via a speech synthesizer. The f0 contour and speech rate of these sentences were altered to produce 4 stimulus sets: (a) normal rate with a fixed f0 level, (b) slow rate with a fixed f0 level, (c) normal rate with prosodically natural f0 variation, and (d) normal rate with prosodically unnatural f0 variation. Sixteen listeners provided orthographic transcriptions and judgments of naturalness for these stimuli. Results Sentences with f0 variation were rated as more natural than those with a fixed f0 level. Conversely, sentences with a fixed f0 level demonstrated higher intelligibility than those with f0 variation. Speech rate did not affect the intelligibility of stimuli with a fixed f0 level. Communication efficiency was highest for sentences produced at a normal rate and a fixed f0 level. Conclusions Sentence-level f0 variation increased naturalness ratings of synthesized speech, whether the variation was prosodically natural or not. However, these f0 variations reduced intelligibility. There is evidence of a trade-off in naturalness and intelligibility of synthesized speech, which may impact future speech synthesis designs. Supplemental Material https://doi.org/10.23641/asha.8847833


2007 ◽  
Vol 8 (3) ◽  
pp. 391-410 ◽  
Author(s):  
Justine Cassell ◽  
Andrea Tartaro

What is the hallmark of success in human–agent interaction? In animation and robotics, many have concentrated on the looks of the agent — whether the appearance is realistic or lifelike. We present an alternative benchmark that lies in the dyad and not the agent alone: Does the agent’s behavior evoke intersubjectivity from the user? That is, in both conscious and unconscious communication, do users react to behaviorally realistic agents in the same way they react to other humans? Do users appear to attribute similar thoughts and actions? We discuss why we distinguish between appearance and behavior, why we use the benchmark of intersubjectivity, our methodology for applying this benchmark to embodied conversational agents (ECAs), and why we believe this benchmark should be applied to human–robot interaction.


2011 ◽  
Author(s):  
Tsuneo Kato ◽  
Makoto Yamada ◽  
Nobuyuki Nishizawa ◽  
Keiichiro Oura ◽  
Keiichi Tokuda

Author(s):  
Silvio Barra ◽  
Maria De Marsico ◽  
Chiara Galdi

In this chapter, the authors present some issues related to automatic face image tagging techniques. Their main purpose in user applications is to support the organization (indexing) and retrieval (or easy browsing) of images or videos in large collections. Their core modules include algorithms and strategies for handling very large face databases, mostly acquired in real conditions. As a background for understanding how automatic face tagging works, an overview about face recognition techniques is given, including both traditional approaches and novel proposed techniques for face recognition in uncontrolled settings. Moreover, some applications and the way they work are summarized, in order to depict the state of the art in this area of face recognition research. Actually, many of them are used to tag faces and to organize photo albums with respect to the person(s) presented in annotated photos. This kind of activity has recently expanded from personal devices to social networks, and can also significantly support more demanding tasks, such as automatic handling of large editorial collections for magazine publishing and archiving. Finally, a number of approaches to large-scale face datasets as well as some automatic face image tagging techniques are presented and compared. The authors show that many approaches, both in commercial and research applications, still provide only a semi-automatic solution for this problem.


Author(s):  
Andrej Zgank ◽  
Izidor Mlakar ◽  
Uros Berglez ◽  
Danilo Zimsek ◽  
Matej Borko ◽  
...  

The chapter presents an overview of human-computer interfaces, which are a crucial element of an ambient intelligence solution. The focus is given to the embodied conversational agents, which are needed to communicate with users in a most natural way. Different input and output modalities, with supporting methods, to process the captured information (e.g., automatic speech recognition, gesture recognition, natural language processing, dialog processing, text to speech synthesis, etc.), have the crucial role to provide the high level of quality of experience to the user. As an example, usage of embodied conversational agent for e-Health domain is proposed.


2013 ◽  
Vol 1 (1) ◽  
pp. 54-67
Author(s):  
Kanu Boku ◽  
Taro Asada ◽  
Yasunari Yoshitomi ◽  
Masayoshi Tabuse

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”


Author(s):  
Mahbubur R. Syed ◽  
Shuvro Chakrobartty ◽  
Robert J. Bignall

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.


Sign in / Sign up

Export Citation Format

Share Document