Synthetic versus human voices in audiobooks: The human emotional intimacy effect

2021 ◽  
pp. 146144482110241
Author(s):  
Emma Rodero ◽  
Ignacio Lucas

Human voices narrate most audiobooks, but the fast development of speech synthesis technology has enabled the possibility of using artificial voices. This raises the question of whether the listeners’ cognitive processing is the same when listening to a synthetic or a human voice telling a story. This research aims to compare the listeners’ perception, creation of mental images, narrative engagement, physiological response, and recognition of information when listening to stories conveyed by human and synthetic voices. The results showed that listeners enjoyed stories narrated by a human voice more than a synthetic one. Also, they created more mental images, were more engaged, paid more attention, had a more positive emotional response, and remembered more information. Speech synthesis has experienced considerable progress. However, there are still significant differences versus human voices, so that using them to narrate long stories, such as audiobooks do, is difficult.

2021 ◽  
pp. 030573562110316
Author(s):  
Elena Saiz-Clar ◽  
Miguel Ángel Serrano ◽  
José Manuel Reales

The relationship between parameters extracted from the musical stimuli and emotional response has been traditionally approached using several physical measures extracted from time or frequency domains. From time-domain measures, the musical onset is defined as the moment in that any musical instrument or human voice issues a musical note. The onsets’ sequence in the performance of a specific musical score creates what is known as the onset curve (OC). The influence of the structure of OC on the emotional judgment of people is not known. To this end, we have applied principal component analysis on a complete set of variables extracted from the OC to capture their statistical structure. We have found a trifactorial structure related to activation and valence dimensions of emotional judgment. The structure has been cross-validated using different participants and stimuli. In this way, we propose the factorial scores of the OC as a reliable and relevant piece of information to predict the emotional judgment of music.


Author(s):  
Scotty D. Craig ◽  
Erin K. Chiou ◽  
Noah L. Schroeder

The current study investigates if a virtual human’s voice can impact the user’s trust in interacting with the virtual human in a learning setting. It was hypothesized that trust is a malleable factor impacted by the quality of the virtual human’s voice. A randomized alternative treatments design with a pretest placed participants in either a low-quality Text-to-Speech (TTS) engine female voice (Microsoft speech engine), a high-quality TTS engine female voice (Neospeech voice engine), or a human voice (native female English speaker) condition. All three treatments were paired with the same female virtual human. Assessments for the study included a self-report pretest on knowledge of meteorology, which occurred before viewing the instructional video, and a measure of system trust. The current study found that voice type impacts a user’s trust ratings, with the human voice resulting in higher ratings compared to the two synthetic voices.


Author(s):  
Marvin Coto-Jiménez ◽  
John Goddard-Close

Recent developments in speech synthesis have produced systems capable of producing speech which closely resembles natural speech, and researchers now strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents. Speech synthesis based on Hidden Markov Models (HMM) is of great interest to researchers, due to its ability to produce sophisticated features with a small footprint. Despite some progress, its quality has not yet reached the level of the current predominant unit-selection approaches, which select and concatenate recordings of real speech, and work has been conducted to try to improve HMM-based systems. In this paper, we present an application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis. Our motivation stems from a similar desire to obtain characteristics which are closer to those of natural speech. The paper analyzes four types of postfilters obtained using five voices, which range from a single postfilter to enhance all the parameters, to a multi-stream proposal which separately enhances groups of parameters. The different proposals are evaluated using three objective measures and are statistically compared to determine any significance between them. The results described in the paper indicate that HMM-based voices can be enhanced using this approach, specially for the multi-stream postfilters on the considered objective measures.


2006 ◽  
Vol 03 (03) ◽  
pp. 371-391 ◽  
Author(s):  
HELMUT PRENDINGER ◽  
CHRISTIAN BECKER ◽  
MITSURU ISHIZUKA

This paper presents a novel method for evaluating the impact of animated interface agents with affective and empathic behavior. While previous studies relied on questionnaires in order to assess the user's overall experience with the interface agent, we will analyze users' physiological response (skin conductance and electromyography), which allows us to estimate affect-related user experiences on a moment-by-moment basis without interfering with the primary interaction task. As an interaction scenario, a card game has been implemented where the user plays against a virtual opponent. The findings of our study indicate that within a competitive gaming scenario, (i) the absence of the agent's display of negative emotions is conceived as arousing or stress-inducing, and (ii) the valence of users' emotional response is congruent with the valence of the emotion expressed by the agent. Our results for skin conductance could also be reproduced by assuming a local rather than a global baseline.


2018 ◽  
Vol 48 (2) ◽  
pp. 297-314 ◽  
Author(s):  
Scott Bannister

Musically induced chills, an emotional response accompanied by gooseflesh, shivers and tingling sensations, are an intriguing aesthetic phenomenon. Although chills have been linked to musical features, personality traits and listening contexts, there exists no comprehensive study that surveys the general characteristics of chills, such as emotional qualities. Thus, the present research aimed to develop a broad understanding of the musical chills response, in terms of emotional characteristics, types of music and chill-inducing features, and listening contexts. Participants ( N = 375) completed a survey collecting qualitative responses regarding a specific experience of musical chills, with accompanying quantitative ratings of music qualia and underlying mechanisms. Participants could also describe two more “chills pieces”. Results indicate that chills are often experienced as a mixed and moving emotional state, and commonly occur in isolated listening contexts. Recurring musical features linked to chills include crescendos, the human voice, lyrics, and concepts such as unity and communion in the music. Findings are discussed in terms of theories regarding musical chills, and implications for future empirical testing of the response.


Author(s):  
H. Timothy Bunnell ◽  
Christopher A. Pennington

The authors review developments in Computer Speech Synthesis (CSS) over the past two decades, focusing on the relative advantages as well as disadvantages of the two dominant technologies: rule-based synthesis; and data-based synthesis. Based on this discussion, they conclude that data-based synthesis is presently the best technology for use in Speech Generating Devices (SGDs) used as communication aids. They examine the benefits associated with data-based synthesis such as personal voices, greater intelligibility and improved naturalness, discuss problems that are unique to data-based synthesis systems, and highlight areas where all types of CSS need to be improved for use in assistive devices. Much of this discussion will be from the perspective of the ModelTalker project, a data-based CSS system for voice banking that provides practical, affordable personal synthetic voices for people using SGDs to communicate. The authors conclude with consideration of some emerging technologies that may prove promising in future SGDs.


Author(s):  
Elissa Moses ◽  
Kimberly Rose Clark ◽  
Norman J. Jacknis

This chapter summarizes the role that artificial intelligence and machine learning (AI/ML) are expected to play at every stage of advertising development, assessment, and execution. Together with advances in neuroscience for measuring attention, cognitive processing, emotional response, and memory, AI/ML have advanced to a point where analytics can be used to identify variables that drive more effective advertising and predict enhanced performance. In addition, the cost of computation has declined, making platforms to apply these tools much less expensive and within reach. The authors then offer recommendations for 1) understanding the clients/customers and users of the products and services that will be advertised, 2) aiding creativity in the process of designing advertisements, 3) testing the impact of advertisements, and 4) identifying the optimum placement of advertisements.


2018 ◽  
Vol 18 (1) ◽  
pp. 86-110
Author(s):  
Brett Maiden

Abstract This paper examines the demons Pazuzu and Lamaštu from a cognitive science perspective. As hybrid creatures, the iconography of these demons combines an array of anthropomorphic and zoomorphic properties, and is therefore marked by a high degree of conceptual complexity. In a technical sense, they are what cognitive researchers refer to as radically “counterintuitive” representations. However, highly complex religious concepts are difficult in terms of cognitive processing, memory, and transmission, and, as a result, are prone to being spontaneously simplified in structure. Accordingly, there is reason to expect that the material images of Pazuzu and Lamaštu differed from the corresponding mental images of these demons. Specifically, it is argued here that in ancient cognition and memory, the demons would have been represented in a more cognitively optimal manner. This hypothesis is further supported by a detailed consideration of the full repertoire of iconographic and textual sources.


Author(s):  
Neasa Ní Chiaráin

Tá an córas sintéiseach téacs-go-hurlabhra, ABAIR (www.abair.ie), á fhorbairt sa tSaotharlann Foghraíochta agus Urlabhra i gColáiste na Tríonóide le roinnt blianta anuas agus tá na guthanna sintéiseacha ar fáil anois sna trí mhórchanúint – Canúint na Mumhan (baineann agus fireann), Canúint Connacht (fireann) agus Canúint Uladh (baineann). Tá obair thaighde ar siúl sa tSaotharlann le blianta beaga anuas chun féachaint ar na feidhmeanna ar féidir a bhaint as na guthanna seo. Tá an páipéar seo dírithe ar an úsáid a d'fhéadfaí a bhaint astu i réimse Fhoghlaim Ríomhchuidithe Teangacha-Chliste (FRT-Chliste) agus go háirithe ar an úsáid a d'fhéadfaí a bhaint astu i bhforbairt ardán a cheadódh don fhoghlaimeoir idirghníomhaíocht phearsanta a dhéanamh leis an ríomhaire, rud a chabhródh le foghlaim fhéinriartha na Gaeilge. Léirítear féidearthachtaí na teicneolaíochtaí seo i gcomhthéacs an ardáin phíolótaigh, An Scéalaí, atá á fhorbairt faoi láthair. Text-to-speech synthesis systems are being developed as part of the ABAIR initiative (www.abair.ie), in the Phonetics and Speech Laboratory in Trinity College Dublin. Synthetic voices are now available in the three major dialects - Munster (female and male), Connacht (male) and Ulster (female). This paper gives an overview of the Irish synthetic voices and focuses on their use in the context of Intelligent Computer-Assisted Language Learning (iCALL) and in particular their use in the development of interactive language learning platforms for the self-directed learning of Irish. The potential of this technology is demonstrated in the context of a new iCALL platform, An Scéalaí (‘the Storyteller’), currently under development.


2014 ◽  
Vol 536-537 ◽  
pp. 105-110
Author(s):  
Ran Ran Chang ◽  
Xiao Qing Yu ◽  
Ying Ying Yuan ◽  
Wang Gen Wan

Speech synthesis is a hot research of artificial intelligence today, and urgent difficulty to overcome is how to make the machine more "emotional intelligence" for the human-computer interaction. With the STRAIGHT algorithm, this paper extracted the acoustic feature parameters of the speech signals and did statistical analysis, modified the characteristic parameters of the neutral sounds to synthesize emotional speeches, including happy, angry, frustration, then analyzed a frame of spectrum of synthetic emotional speeches through standard voices, voices added noise and voices de-nosing. The experimental results show that the method is feasible and the synthetic emotional speeches through voices de-nosing are better than voices added noise.


Sign in / Sign up

Export Citation Format

Share Document