Public Perceptions Towards Synthetic Voice Technology

Author(s):  
Ben Noah ◽  
Arathi Sethumadhavan ◽  
Josh Lovejoy ◽  
David Mondello

Text-to-Speech (TTS) technologies have provided ways to produce acoustic approximations of human voices. However, recent advancements in machine learning (i.e., neural network TTS) have helped move beyond coarse mimicry and towards more natural-sounding speech. With only a small collection of recorded utterances, it is now possible to generate wholly synthetic voices indistinguishable from those of human speakers. While these new approaches to speech synthesis can help facilitate more seamless experiences with artificial agents, they also lower the barrier to entry for those seeking to perpetrate deception. As such, in the development of these technologies, it is important to anticipate potential harms and devise strategies to help mitigate against misuse. This paper presents findings from a 360-person survey that assessed public perceptions of synthetic voices, with a particular focus on how voice type and social scenarios impact ratings of trust. Findings have implications for the responsible deployment of synthetic speech technologies.

Author(s):  
Mahbubur R. Syed ◽  
Shuvro Chakrobartty ◽  
Robert J. Bignall

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.


Author(s):  
Vo Quang Dieu Ha ◽  
Nguyen Manh Tuan ◽  
Cao Xuan Nam ◽  
Pham Minh Nhut ◽  
Vu Hai Quan

This paper presents a complete specification of the  Vietnamese  speech  synthesis  system  named  VOS (Voice  of  Southern  Vietnam).  Due  to  the  fact  that current  Vietnamese  text-to-speech  systems  lack  the naturalness of output synthetic speech, VOS is based on the  unit  selection  approach  which  aims  to  achieve maximum  naturalness.  There  are  three  main  parts constituting VOS: a corpus manager, a synthesizer, and a  transliteration  model.  Corpus  manager  manages automated  speech  indexing  and  segmentation  for  unit selection  executed  by  the  synthesizer,  while transliteration  model  deals  with  the  pronunciation  of words  in  foreign  languages.  A  comparative experimental  evaluation  of  VnSpeech,  VietVoice,  and VOS  is  conducted  using  ITU-T  P.85  standard.  Results show  that  VOS  outperforms  the  former  two  TTS systems.


2018 ◽  
Vol 4 (59) ◽  
pp. 192
Author(s):  
Milan Sečujski ◽  
Stevan Ostrogonac ◽  
Siniša Suzić ◽  
Darko Pekar

2016 ◽  
Vol 13 (10) ◽  
pp. 7538-7544
Author(s):  
T Jayasankar ◽  
J. Arputha Vijayaselvi

A Feed Forward Neural Network (FFNN) model primarily based unrestricted delivery prediction of language unit length pattern info speech synthesis system is that the focus of this paper. Estimation of delivery parameter of segmental length plays a essential half in unrestricted concatenative synthesis Text To Speech System (TTS) is capable of synthesize natural sounding speech with improved quality. Common options to coach the Neural Network enclosed language unit position within the phrase, context of language unit, language unit position within the word, language unit nucleus and amp; language unit identity square measure extracted from the text. Back-propagation Neural Network (BPNN) formula is one in every of the foremost wide used and a preferred technique to optimize the feed forward neural network coaching in delivery prediction. For enhance the accuracy of delivery prediction language unit length in neural BP, that’s Cuckoo Search formula to seek out the structure of the neural network with least weights while not compromising on the prediction error is planned. Speech information is adopted to check the length prediction performance of planned SOCNN, wherever the obtained results demonstrate a marked improvement over the essential BP. The system performance is shown mistreatment the synthesizing natural sounding speech for Tamil, national language of Republic of India.


2020 ◽  
Vol 34 (05) ◽  
pp. 8228-8235
Author(s):  
Naihan Li ◽  
Yanqing Liu ◽  
Yu Wu ◽  
Shujie Liu ◽  
Sheng Zhao ◽  
...  

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.


Author(s):  
Prof. Mrunalinee Patole ◽  
Akhilesh Pandey ◽  
Kaustubh Bhagwat ◽  
Mukesh Vaishnav ◽  
Salikram Chadar

Text to Speech (TTS) is a form of speech synthesis wherein the text is converted right into a spoken human-like voice output. The state of the art strategies for TTS employs a neural network based totally method. This paintings pursuits to take a look at a number of the problems and barriers gift inside the contemporary works, especially Tacotron-2, and attempts to in addition enhance its performance by means of editing its structure. till now many papers were published on these topics that display various exceptional TTS structures by means of developing new TTS products. The aim is to have a look at different textual content-to-Speech structures. in comparison to different text-to-Speech systems, Tacotron2 has multiple blessings. In opportunity algorithms like CNN, speedy-CNN the algorithmic program may not investigate the photo fully however in YOLO the algorithmic application check out the picture absolutely by predicting the bounding boxes through using convolutional network and possibilities for those packing containers and detects the image faster in comparison to alternative algorithms.


Author(s):  
Amelia J. Gully ◽  
Takenori Yoshimura ◽  
Damian T. Murphy ◽  
Kei Hashimoto ◽  
Yoshihiko Nankaku ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document