scholarly journals Perception of concatenative vs. neural text-to-speech (TTS): Differences in intelligibility in noise and language attitudes

2020 ◽  
Author(s):  
Michelle Cohn ◽  
Georgia Zellou

In this study, we test two questions of how users perceive neural vs. concatenative text-to-speech (TTS): 1) does the TTS method influence speech intelligibility in adverse listening conditions? and 2) does a user’s ratings of the voice’s social attributes shape intelligibility? We used identical speaker training datasets for a set of 4 speakers (using AWS Polly TTS). In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences generated in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes: sentences generated with neural TTS were rated as more human-like, natural, likeable, and familiar than concatenative TTS utterances. Furthermore, we observed individual variation in a given listener’s SPIN accuracy measures: how human- like/natural they rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech —and that these factors are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.

2021 ◽  
pp. 136700692110286
Author(s):  
Giovanna Morini ◽  
Rochelle S. Newman

Aims and objectives: The purpose of this study was to examine whether differences in language exposure (i.e., being raised in a bilingual versus a monolingual environment) influence young children’s ability to comprehend words when speech is heard in the presence of background noise. Methodology: Forty-four children (22 monolinguals and 22 bilinguals) between the ages of 29 and 31 months completed a preferential looking task where they saw picture-pairs of familiar objects (e.g., balloon and apple) on a screen and simultaneously heard sentences instructing them to locate one of the objects (e.g., look at the apple!). Speech was heard in quiet and in the presence of competing white noise. Data and analyses: Children’s eye-movements were coded off-line to identify the proportion of time they fixated on the correct object on the screen and performance across groups was compared using a 2 × 3 mixed analysis of variance. Findings: Bilingual toddlers performed worse than monolinguals during the task. This group difference in performance was particularly clear when the listening condition contained background noise. Originality: There are clear differences in how infants and adults process speech in noise. To date, developmental work on this topic has mainly been carried out with monolingual infants. This study is one of the first to examine how background noise might influence word identification in young bilingual children who are just starting to acquire their languages. Significance: High noise levels are often reported in daycares and classrooms where bilingual children are present. Therefore, this work has important implications for learning and education practices with young bilinguals.


2018 ◽  
Vol 27 (1) ◽  
pp. 222-236 ◽  
Author(s):  
Alyssa Wild ◽  
Houri K. Vorperian ◽  
Ray D. Kent ◽  
Daniel M. Bolt ◽  
Diane Austin

Purpose A single-word identification test was used to study speech production in children and adults with Down syndrome (DS) to determine the developmental pattern of speech intelligibility with an emphasis on vowels. Method Speech recordings were collected from 62 participants with DS aged 4–40 years and 25 typically developing participants aged 4–7 years. Panels of 5 adult lay listeners transcribed the speech recordings orthographically, and their responses were scored in comparison with the speakers' target words. Results Speech intelligibility in persons with DS improved with age, especially between the ages of 4 and 16 years. Whereas consonants contribute to intelligibility, vowels also played an important role in reduced intelligibility with an apparent developmental difference in low versus high vowels, where the vowels /æ/ and/ɑ/ developed at a later age than /i/ and /u/. Interspeaker variability was large, with male individuals being generally less intelligible than female individuals and some adult men having very low intelligibility. Conclusion Results show age-related patterns in speech intelligibility in persons with DS and identify the contribution of dimensions of vowel production to intelligibility. The methods used clarify the phonetic basis of reduced intelligibility, with implications for assessment and treatment.


2019 ◽  
Author(s):  
Mark D. Fletcher ◽  
Amatullah Hadeedi ◽  
Tobias Goehring ◽  
Sean R Mills

Cochlear implant (CI) users receive only limited sound information through their implant, which means that they struggle to understand speech in noisy environments. Recent work has suggested that combining the electrical signal from the CI with a haptic signal that provides crucial missing sound information (“electro-haptic stimulation”; EHS) could improve speech-in-noise performance. The aim of the current study was to test whether EHS could enhance speech-in-noise performance in CI users using: (1) a tactile signal derived using an algorithm that could be applied in real time, (2) a stimulation site appropriate for a real-world application, and (3) a tactile signal that could readily be produced by a compact, portable device. We measured speech intelligibility in multi-talker noise with and without vibro-tactile stimulation of the wrist in CI users, before and after a short training regime. No effect of EHS was found before training, but after training EHS was found to improve the number of words correctly identified by an average of 8.3 %-points, with some users improving by more than 20 %-points. Our approach could offer an inexpensive and non-invasive means of improving speech-in-noise performance in CI users.


2019 ◽  
Author(s):  
Stefanie Schelinski ◽  
Katharina von Kriegstein

We tested the ability to recognise speech-in-noise and its relation to the ability to discriminate vocal pitch in adults with high-functioning autism spectrum disorder (ASD) and typically developed adults (matched pairwise on age, sex, and IQ). Typically developed individuals understood speech in higher noise levels as compared to the ASD group. Within the control group but not within the ASD group, better speech-in-noise recognition abilities were significantly correlated with better vocal pitch discrimination abilities. Our results show that speech-in-noise recognition is restricted in people with ASD. We speculate that perceptual impairments such as difficulties in vocal pitch perception might be relevant in explaining these difficulties in ASD.


2012 ◽  
Author(s):  
Ann K. Syrdal ◽  
H. Timothy Bunnell ◽  
Susan R. Hertz ◽  
Taniya Mishra ◽  
Murray Spiegel ◽  
...  

2021 ◽  
Vol 69 (1) ◽  
pp. 77-85
Author(s):  
Cheol-Ho Jeong ◽  
Wan-Ho Cho ◽  
Ji-Ho Chang ◽  
Sung-Hyun Lee ◽  
Chang-Wook Kang ◽  
...  

Hearing-impaired people need more stringent acoustic and noise requirements than normal-hearing people in terms of speech intelligibility and listening effort. Multiple guidelines recommend a maximum reverberation time of 0.4 s in classrooms, signal-to-noise ratios (SNRs) greater than 15 dB, and ambient noise levels lower than 35 dBA. We measured noise levels and room acoustic parameters of 12 classrooms in two schools for hearing-impaired pupils, a dormitory apartment for the hearing-impaired, and a church mainly for the hearing-impaired in the Republic of Korea. Additionally, subjective speech clarity and quality of verbal communication were evaluated through questionnaires and interviews with hearing-impaired students in one school. Large differences in subjective speech perception were found between younger primary school pupils and older pupils. Subjective data from the questionnaire and interview were inconsistent; major challenges in obtaining reliable subjective speech perception and limitations of the results are discussed.


Author(s):  
Michael Ben-Avie ◽  
Régine Randall ◽  
Diane Weaver Dunne ◽  
Chris Kelly

Conventional methods of addressing the needs of students with print disabilities include text-to-speech services. One major drawback of text-to-speech technologies is that computerized speech simply articulates the same words in a text whereas human voice can convey emotions such as excitement, sadness, fear, or joy. Audiobooks have human narration, but are designed for entertainment and not for teaching word identification, fluency, vocabulary, and comprehension to students. This chapter focuses on the 3-year pilot of CRISKids; all CRIS recordings feature human narration. The pilot demonstrated that students who feel competent in their reading and class work tend to be more engaged in classroom routines, spend more time on task and demonstrate greater comprehension of written materials. When more demonstrate these behaviors and skills, teachers are better able to provide meaningful instruction, since less time is spent on issues of classroom management and redirection. Thus, CRISKids impacts not only the students with print disabilities, but all of the students in the classroom.


2020 ◽  
Vol 24 ◽  
pp. 233121652097563
Author(s):  
Christopher F. Hauth ◽  
Simon C. Berning ◽  
Birger Kollmeier ◽  
Thomas Brand

The equalization cancellation model is often used to predict the binaural masking level difference. Previously its application to speech in noise has required separate knowledge about the speech and noise signals to maximize the signal-to-noise ratio (SNR). Here, a novel, blind equalization cancellation model is introduced that can use the mixed signals. This approach does not require any assumptions about particular sound source directions. It uses different strategies for positive and negative SNRs, with the switching between the two steered by a blind decision stage utilizing modulation cues. The output of the model is a single-channel signal with enhanced SNR, which we analyzed using the speech intelligibility index to compare speech intelligibility predictions. In a first experiment, the model was tested on experimental data obtained in a scenario with spatially separated target and masker signals. Predicted speech recognition thresholds were in good agreement with measured speech recognition thresholds with a root mean square error less than 1 dB. A second experiment investigated signals at positive SNRs, which was achieved using time compressed and low-pass filtered speech. The results demonstrated that binaural unmasking of speech occurs at positive SNRs and that the modulation-based switching strategy can predict the experimental results.


2020 ◽  
Vol 10 (13) ◽  
pp. 4552
Author(s):  
Ryoko Nojima ◽  
Natsuko Sugie ◽  
Akira Taguchi ◽  
Jun Kokubo

The main lobby of Hotel Okura Tokyo has a good reputation for its sound environment, which affects the conversations of its users. We assumed that the lobby’s reputation was related to its speech intelligibility. In this study, first, the sound during hotel operations was measured to see if there was a difference in the sound environment between the lobby and the entrance hall. As a result, we clarified that the difference in noise levels affected by the degree of crowdedness of the room was smaller in the lobby than in the other rooms. Subsequently, the indoor noise and speech intelligibility were measured to clarify the correspondence of intelligibility with the lobby’s reputation. As a result, the indoor noise was found to be at a level suitable for hotel lobbies and the intelligibility was good. A comprehensive evaluation that included the results of other acoustical surveys revealed that the lobby of Okura is a space that is suitable for conversations, corresponding to the opinions of users.


2012 ◽  
Vol 132 (3) ◽  
pp. 2080-2080
Author(s):  
Jasmine Beitz ◽  
Kristin Van Engen ◽  
Rajka Smiljanic ◽  
Bharath Chandrasekaran

Sign in / Sign up

Export Citation Format

Share Document