Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis

Action localization in untrimmed videos is an important topic in the field of video understanding. However, existing action localization methods are restricted to a pre-defined set of actions and cannot localize unseen activities. Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization. This task faces three inherent challenges: (1) how to eliminate the influence of semantically inessential contents in image queries; (2) how to deal with the fuzzy localization of inaccurate image queries; (3) how to determine the precise boundaries of target segments. We then propose a novel self-attention interaction localizer to retrieve unseen activities in an end-to-end fashion. Specifically, we first devise a region self-attention method with relative position encoding to learn fine-grained image region representations. Then, we employ a local transformer encoder to build multi-step fusion and reasoning of image and video contents. We next adopt an order-sensitive localizer to directly retrieve the target segment. Furthermore, we construct a new dataset ActivityIBAL by reorganizing the ActivityNet dataset. The extensive experiments show the effectiveness of our method.

Download Full-text

End-to-End Speech Synthesis for Tibetan Multidialect

Complexity ◽

10.1155/2021/6682871 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Xiaona Xu ◽

Li Yang ◽

Yue Zhao ◽

Hui Wang

Keyword(s):

Speech Synthesis ◽

Text Processing ◽

Synthesis System ◽

Sequence Structure ◽

Training Corpus ◽

Text Annotation ◽

Front End ◽

Cyclic Sequence ◽

End To End ◽

Tibetan Dialects

The research on Tibetan speech synthesis technology has been mainly focusing on single dialect, and thus there is a lack of research on Tibetan multidialect speech synthesis technology. This paper presents an end-to-end Tibetan multidialect speech synthesis model to realize a speech synthesis system which can be used to synthesize different Tibetan dialects. Firstly, Wylie transliteration scheme is used to convert the Tibetan text into the corresponding Latin letters, which effectively reduces the size of training corpus and the workload of front-end text processing. Secondly, a shared feature prediction network with a cyclic sequence-to-sequence structure is built, which maps the Latin transliteration vector of Tibetan character to Mel spectrograms and learns the relevant features of multidialect speech data. Thirdly, two dialect-specific WaveNet vocoders are combined with the feature prediction network, which synthesizes the Mel spectrum of Lhasa-Ü-Tsang and Amdo pastoral dialect into time-domain waveform, respectively. The model avoids using a large number of Tibetan dialect expertise for processing some time-consuming tasks, such as phonetic analysis and phonological annotation. Additionally, it can directly synthesize Lhasa-Ü-Tsang and Amdo pastoral speech on the existing text annotation. The experimental results show that the synthesized speech of Lhasa-Ü-Tsang and Amdo pastoral dialect based on our proposed method has better clarity and naturalness than the Tibetan monolingual model.

Download Full-text

End-to-End Network Slices: From Network Function Profiles to Fine-Grained SLAs

10.5753/sbrc_estendido.2019.7790 ◽

2019 ◽

Author(s):

Raphael V. Rosa ◽

Christian Esteve Rothenberg

Keyword(s):

Augmented Reality ◽

State Of The Art ◽

Vehicular Communications ◽

Automated Extraction ◽

Network Slicing ◽

Fine Grained ◽

Network Function ◽

End To End ◽

Future Work ◽

Administrative Domains

Towards end-to-end network slicing, diverse envisioned 5G services (eg, augmented reality, vehicular communications, IoT) Call for advanced multi-administrative domain service deployments, open challenges from vertical Agreement (SLA) -based orchestration hazards. Through different proposed methodologies and demonstrated prototypes, this work showcases: the automated extraction of network function profiles; the manners to analyze how such profiles compose programmable network slice footprints; and the means to perform fine-grained auditable SLAs for end-to-end network slicing among multiple administrative domains. Sustained on state-of-the-art networking concepts, this work presents contributions by detecting roots on standardization efforts and best-of-breed open source embodiments, each one standing prominent future work topics in shape of its shortcomings.

Download Full-text