Perception of concatenative vs. neural text-to-speech (TTS): Differences in intelligibility in noise and language attitudes
In this study, we test two questions of how users perceive neural vs. concatenative text-to-speech (TTS): 1) does the TTS method influence speech intelligibility in adverse listening conditions? and 2) does a user’s ratings of the voice’s social attributes shape intelligibility? We used identical speaker training datasets for a set of 4 speakers (using AWS Polly TTS). In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences generated in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes: sentences generated with neural TTS were rated as more human-like, natural, likeable, and familiar than concatenative TTS utterances. Furthermore, we observed individual variation in a given listener’s SPIN accuracy measures: how human- like/natural they rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech —and that these factors are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.