Public Perceptions Towards Synthetic Voice Technology

Text-to-Speech (TTS) technologies have provided ways to produce acoustic approximations of human voices. However, recent advancements in machine learning (i.e., neural network TTS) have helped move beyond coarse mimicry and towards more natural-sounding speech. With only a small collection of recorded utterances, it is now possible to generate wholly synthetic voices indistinguishable from those of human speakers. While these new approaches to speech synthesis can help facilitate more seamless experiences with artificial agents, they also lower the barrier to entry for those seeking to perpetrate deception. As such, in the development of these technologies, it is important to anticipate potential harms and devise strategies to help mitigate against misuse. This paper presents findings from a 360-person survey that assessed public perceptions of synthetic voices, with a particular focus on how voice type and social scenarios impact ratings of trust. Findings have implications for the responsible deployment of synthetic speech technologies.

Download Full-text

Text-to-Speech Synthesis

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch135 ◽

2011 ◽

pp. 957-963

Author(s):

Mahbubur R. Syed ◽

Shuvro Chakrobartty ◽

Robert J. Bignall

Keyword(s):

Speech Production ◽

Speech Synthesis ◽

Synthetic Speech ◽

Practical Application ◽

Text To Speech ◽

Synthesis System ◽

System A ◽

Vocal System ◽

Text To Speech Synthesis ◽

Computer Based

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.

Download Full-text

Syllable based text to speech synthesis system using auto associative neural network prosody prediction

International Journal of Speech Technology ◽

10.1007/s10772-013-9210-8 ◽

2013 ◽

Vol 17 (2) ◽

pp. 91-98 ◽

Cited By ~ 1

Author(s):

Sudhakar Sangeetha ◽

Sekar Jothilakshmi

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Text To Speech ◽

Synthesis System ◽

Text To Speech Synthesis ◽

Auto Associative Neural Network ◽

Prosody Prediction

Download Full-text

VOS: the Corpus-Based etnamese Text-to-Speech System

Research and Development on Information and Communication Technology ◽

10.32913/mic-ict-research.v3.n7.285 ◽

2010 ◽

Author(s):

Vo Quang Dieu Ha ◽

Nguyen Manh Tuan ◽

Cao Xuan Nam ◽

Pham Minh Nhut ◽

Vu Hai Quan

Keyword(s):

Experimental Evaluation ◽

Speech Synthesis ◽

Foreign Languages ◽

Synthetic Speech ◽

Text To Speech ◽

Synthesis System ◽

Unit Selection ◽

Southern Vietnam ◽

Selection Approach ◽

Complete Specification

This paper presents a complete specification of the Vietnamese speech synthesis system named VOS (Voice of Southern Vietnam). Due to the fact that current Vietnamese text-to-speech systems lack the naturalness of output synthetic speech, VOS is based on the unit selection approach which aims to achieve maximum naturalness. There are three main parts constituting VOS: a corpus manager, a synthesizer, and a transliteration model. Corpus manager manages automated speech indexing and segmentation for unit selection executed by the synthesizer, while transliteration model deals with the pronunciation of words in foreign languages. A comparative experimental evaluation of VnSpeech, VietVoice, and VOS is conducted using ITU-T P.85 standard. Results show that VOS outperforms the former two TTS systems.

Download Full-text

Learning Prosodic Stress from Data in Neural Network based Text-to-Speech Synthesis

SPIIRAS Proceedings ◽

10.15622/sp.59.8 ◽

2018 ◽

Vol 4 (59) ◽

pp. 192

Author(s):

Milan Sečujski ◽

Stevan Ostrogonac ◽

Siniša Suzić ◽

Darko Pekar

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis

Download Full-text

Prediction of Syllable Duration Using Structure Optimised Cuckoo Search Neural Network (SOCNN) for Text-To-Speech

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2016.5750 ◽

2016 ◽

Vol 13 (10) ◽

pp. 7538-7544

Author(s):

T Jayasankar ◽

J. Arputha Vijayaselvi

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Unit Length ◽

Back Propagation ◽

Cuckoo Search ◽

National Language ◽

Text To Speech ◽

Feed Forward Neural Network ◽

Feed Forward ◽

The Neural Network

A Feed Forward Neural Network (FFNN) model primarily based unrestricted delivery prediction of language unit length pattern info speech synthesis system is that the focus of this paper. Estimation of delivery parameter of segmental length plays a essential half in unrestricted concatenative synthesis Text To Speech System (TTS) is capable of synthesize natural sounding speech with improved quality. Common options to coach the Neural Network enclosed language unit position within the phrase, context of language unit, language unit position within the word, language unit nucleus and amp; language unit identity square measure extracted from the text. Back-propagation Neural Network (BPNN) formula is one in every of the foremost wide used and a preferred technique to optimize the feed forward neural network coaching in delivery prediction. For enhance the accuracy of delivery prediction language unit length in neural BP, that’s Cuckoo Search formula to seek out the structure of the neural network with least weights while not compromising on the prediction error is planned. Speech information is adopted to check the length prediction performance of planned SOCNN, wherever the obtained results demonstrate a marked improvement over the essential BP. The system performance is shown mistreatment the synthesizing natural sounding speech for Tamil, national language of Republic of India.

Download Full-text

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6337 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8228-8235

Author(s):

Naihan Li ◽

Yanqing Liu ◽

Yu Wu ◽

Shujie Liu ◽

Sheng Zhao ◽

...

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Neural Model ◽

Attention Mechanism ◽

Maximum Length ◽

Prosodic Features ◽

Text To Speech ◽

Linguistic Features ◽

Excellent Quality ◽

Speech Model

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

Download Full-text

Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2016slp0011 ◽

2016 ◽

Vol E99.D (10) ◽

pp. 2471-2480

Author(s):

Xin WANG ◽

Shinji TAKAKI ◽

Junichi YAMAGISHI

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Continuous Representation ◽

Text To Speech ◽

Text To Speech Synthesis ◽

Linguistic Units

Download Full-text

Optimisation of Artificial Neural Network Topology Applied in the Prosody Control in Text-to-Speech Synthesis

SOFSEM 2000: Theory and Practice of Informatics - Lecture Notes in Computer Science ◽

10.1007/3-540-44411-4_31 ◽

2000 ◽

pp. 420-430

Author(s):

Václav Šebesta ◽

Jana Tučková

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Network Topology ◽

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis ◽

Artificial Neural

Download Full-text

A Survey on “Text-to-Speech Systems for Real-Time Audio Synthesis”

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1400 ◽

2021 ◽

pp. 375-379

Author(s):

Prof. Mrunalinee Patole ◽

Akhilesh Pandey ◽

Kaustubh Bhagwat ◽

Mukesh Vaishnav ◽

Salikram Chadar

Keyword(s):

Neural Network ◽

Real Time ◽

Speech Synthesis ◽

State Of The Art ◽

The State ◽

Text To Speech ◽

Convolutional Network ◽

Speech Structures ◽

Bounding Boxes ◽

Textual Content

Text to Speech (TTS) is a form of speech synthesis wherein the text is converted right into a spoken human-like voice output. The state of the art strategies for TTS employs a neural network based totally method. This paintings pursuits to take a look at a number of the problems and barriers gift inside the contemporary works, especially Tacotron-2, and attempts to in addition enhance its performance by means of editing its structure. till now many papers were published on these topics that display various exceptional TTS structures by means of developing new TTS products. The aim is to have a look at different textual content-to-Speech structures. in comparison to different text-to-Speech systems, Tacotron2 has multiple blessings. In opportunity algorithms like CNN, speedy-CNN the algorithmic program may not investigate the photo fully however in YOLO the algorithmic application check out the picture absolutely by predicting the bounding boxes through using convolutional network and possibilities for those packing containers and detects the image faster in comparison to alternative algorithms.

Download Full-text