scholarly journals TTS-driven Embodied Conversation Avatar for UMB-SmartTV

Author(s):  
Matej Rojc ◽  
Zdravko Kačič ◽  
Marko Presker ◽  
Izidor Mlakar

When human-TV interaction is performed by remote controller and mobile devices only, the interactions tend to be mechanical, dreary and uninformative. To achieve more advanced interaction, and more human-human like, we introduce the virtual agent technology as a feedback interface. Verbal and co-verbal gestures are linked through complex mental processes, and although they represent different sides of the same mental process, the formulations of both are quite different. Namely, verbal information is bound by rules and grammar, whereas gestures are influenced by emotions, personality etc. In this paper a TTS-driven behavior generation system is proposed for more advanced interface used for smart IPTV platforms. The system is implemented as a distributive non-IPTV service and integrated into UMB-SmartTV in a service-oriented fashion. The behavior generation system fuses speech and gesture production models by using FSMs and HRG structures. Features for selecting the shape and alignment of co-verbal movement are based on linguistic features (that can be extracted from arbitrary input text), and prosodic features (as predicted within several processing steps in the TTS engine). At the end, the generated speech and co-verbal behavior are animated by an embodied conversational agent (ECA) engine and represented to the user within the UMB-SmarTV user interface.

2018 ◽  
Vol 2018 (69) ◽  
pp. 97-128
Author(s):  
Hanna Jaeger ◽  
Anita Junghanns

AbstractDeaf sign language users oftentimes claim to be able to recognise straight away whether their interlocutors are native signers. To date it is unclear, however, what exactly such judgement calls might be based on. The aim of the research presented was to explore whether specific articulatory features are being associated with signers that have (allegedly) acquired German Sign Language (Deutsche Gebärdensprache, DGS) as their first language. The study is based on the analysis of qualitative and quantitative data. Qualitative data were generated in ten focus group settings. Each group was made up of three participants and one facilitator. Deaf participants’ meta-linguistic claims concerning linguistic features of ‘native signing’ (i. e. what native signing looks like) were qualitatively analysed using grounded theory methods. Quantitative data were generated via a language assessment experiment designed around stimulus material extracted from DGS corpus data. Participants were asked to judge whether or not individual clips extracted from a DGS corpus had been produced by a native signer. Against the backdrop of the findings identified in the focus group data, the stimulus material was subsequently linguistically analysed in order to identify specific linguistic features that might account for some clips to be judged as ‘produced by a native signer’ as opposed to others that were claimed to have been ‘articulated by a non-native signer’. Through juxtaposing meta-linguistic perspectives, the results of a language perception experiment and the linguistic analysis of the stimulus material, the study brings to the fore specific crystallisation points of linguistic and social features indexing linguistic authenticity. The findings break new ground in that they suggest that the face as articulator in general, and micro-prosodic features expressed in the movement of eyes, eyebrows and mouth in particular, play a significant role in the perception of others as (non-)native signers.


2017 ◽  
Vol 3 (1) ◽  
pp. 67-74 ◽  
Author(s):  
Elena Velikaya

‘Discourse is the way that language – either spoken or written – is used for communicative effect in a real-world situation (Thornbury, 2005, p. 7). Thornbury considers the text as the product and the discourse – as a communicative process that involves ‘language and the record of the language that is used in this discourse, which is ‘text’ (ibid). Although presentations are generally categorized as spoken text types, an academic presentation is a compromise between spoken and written text types: on the one hand, it is given in a classroom as an oral text; on the other hand, it is thoroughly prepared as a home assignment in the form of a written text. This article focuses on the analysis of such linguistic features of students’ presentations as cohesion, coherence, and prosody. For this analysis, data were collected from 60 2nd year students of the International College of Economics and Finance (ICEF) presentations on various economic topics which were recorded and examined (the time limit for each of the presentations was 10 minutes); out of 60, 10 presentation texts were selected for auditory analysis, and thematic centers (TCs) were examined using acoustic analysis. Measurements of prosodic parameters such as pitch, intensity, and duration (rate of utterance) were obtained using the computer programs Speech Analyzer 3.0.1 and Pratt (v.4.0.53). The results of these analyses show that students’ presentations are cohesive, coherent and contain TCs, which are characterized by specific prosodic parameters that have a certain effect on the comprehension of these texts, their expressiveness and pragmatic value.


2015 ◽  
Vol 12 (2) ◽  
pp. 29-52
Author(s):  
Smiljana Komar

Direct response television commercials (DRTV) exhibit a very specific style of speech and delivery whose main function is to boost the product’s value and sales. This paper presents the findings of the structural and the linguistic analyses of three English DRTV short form spots as seen on Highstreet TV. The emphasis is on the verbal strategies used by advertisers to get the consumers’ attention, develop their interest and desire to own the product and to convince them to purchase it. These strategies include different lexical, syntactic and prosodic features. The structural analysis focuses mainly on non-verbal strategies of broadcasting advertisements whose purpose is to inspire interest and credibility in potential consumers.


2020 ◽  
Vol 34 (05) ◽  
pp. 8228-8235
Author(s):  
Naihan Li ◽  
Yanqing Liu ◽  
Yu Wu ◽  
Shujie Liu ◽  
Sheng Zhao ◽  
...  

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.


2015 ◽  
Vol 3 (2) ◽  
pp. 13-23 ◽  
Author(s):  
Hiroyasu Horiuchi ◽  
Sachio Saiki ◽  
Shinsuke Matsumoto ◽  
Masahide Namamura

In order to achieve intuitive and easy operations for home network system (HNS), the authors have previously proposed user interface with virtual agent (called HNS virtual agent user interface, HNS-VAUI). The HNS-VAUI was implemented with MMDAgent toolkit. A user can operate appliances and services interactively through dialog with a virtual agent in a screen. However, the previous prototype heavily depends on MMDAgent, which causes a tight coupling between HNS operations and agent behaviors, and poor capability of using external information.To cope with the problem, this paper proposes a service-oriented framework that allows the HNS-VAUI to provide richer interaction. Specifically, the authors decompose the tightly-coupled system into two separate services: MMC Service and MSM service. The MMC service concentrates on controlling detailed behaviors of a virtual agent, whereas the MSM service defines logic of HNS operations and dialog with the agent with richer state machines. The two services are loosely coupled to enable more flexible and sophisticated dialog in the HNS-VAUI. The proposed framework is implemented in a real HNS environment. The authors also conduct a case study with practical service scenarios, to demonstrate effectiveness of the proposed framework.


2009 ◽  
Vol 364 (1535) ◽  
pp. 3539-3548 ◽  
Author(s):  
Catherine Pelachaud

Over the past few years we have been developing an expressive embodied conversational agent system. In particular, we have developed a model of multimodal behaviours that includes dynamism and complex facial expressions. The first feature refers to the qualitative execution of behaviours. Our model is based on perceptual studies and encompasses several parameters that modulate multimodal behaviours. The second feature, the model of complex expressions, follows a componential approach where a new expression is obtained by combining facial areas of other expressions. Lately we have been working on adding temporal dynamism to expressions. So far they have been designed statically, typically at their apex. Only full-blown expressions could be modelled. To overcome this limitation, we have defined a representation scheme that describes the temporal evolution of the expression of an emotion. It is no longer represented by a static definition but by a temporally ordered sequence of multimodal signals.


2014 ◽  
Vol 7 ◽  
pp. 15-45
Author(s):  
Kathryn Brenner ◽  
Kerry Burns ◽  
Jennifer D Ewald

Underrepresented in sport discourse literature, the usually private interactions among television viewers provided the context for this research. The present study built directly on previous findings regarding TV viewer interaction, sport discourse, and speakers’ multiple identities by analyzing the linguistic features of interactions among four male family members while watching televised football in their home. Participants used prosodic features to frame utterances while taking on the voice of fan, coach, or commentator and talking to, for, or about the TV. In general, these viewers talked ‘to’ the TV as fans and coaches, ‘for’ the TV as commentators, and ‘about’ the TV in all three roles. The findings are of potential interest to researchers as well as marketing and advertising companies.


Author(s):  
Yusuke Takahashi ◽  
◽  
Ichiro Kobayashi ◽  

The present text generator using resources based on Systemic Functional Linguistics (SFL). Resources are compiled in a database called the Semiotic Base, which deals with language in context. In contrast to previous SFL-based text generation systems, our comprehensive proposal contains the Context Base, which deals with context surrounding text, and covers all strata, from context to expression. Its text generation process maximizes the use of the Semiotic Base resources, i.e., system networks dealing with linguistic features. Our text generation system is resource-driven, draws heavily on information provided by the Semiotic Base, and minimizes information input.


2018 ◽  
Vol 8 (2) ◽  
pp. 100-111 ◽  
Author(s):  
Maik Friedrich ◽  
Christoph Möhlenbrink

Abstract. Owing to the different approaches for remote tower operation, a standardized set of indicators is needed to evaluate the technical implementations at a task performance level. One of the most influential factors for air traffic control is weather. This article describes the influence of weather metrics on remote tower operations and how to validate them against each other. Weather metrics are essential to the evaluation of different remote controller working positions. Therefore, weather metrics were identified as part of a validation at the Erfurt-Weimar Airport. Air traffic control officers observed weather events at the tower control working position and the remote control working position. The eight participating air traffic control officers answered time-synchronized questionnaires at both workplaces. The questionnaires addressed operationally relevant weather events in the aerodrome. The validation experiment targeted the air traffic control officer’s ability to categorize and judge the same weather event at different workplaces. The results show the potential of standardized indicators for the evaluation of performance and the importance of weather metrics in relation to other evaluation metrics.


Sign in / Sign up

Export Citation Format

Share Document