Perception of concatenative vs. neural text-to-speech (TTS): Differences in intelligibility in noise and language attitudes

Mapping Intimacies ◽

10.31234/osf.io/86wbf ◽

2020 ◽

Author(s):

Michelle Cohn ◽

Georgia Zellou

Keyword(s):

Speech Intelligibility ◽

Language Attitudes ◽

Word Identification ◽

Social Judgments ◽

Text To Speech ◽

Noise Levels ◽

Correct Word ◽

Speech Technology ◽

Speech In Noise ◽

Accuracy Measures

In this study, we test two questions of how users perceive neural vs. concatenative text-to-speech (TTS): 1) does the TTS method influence speech intelligibility in adverse listening conditions? and 2) does a user’s ratings of the voice’s social attributes shape intelligibility? We used identical speaker training datasets for a set of 4 speakers (using AWS Polly TTS). In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences generated in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes: sentences generated with neural TTS were rated as more human-like, natural, likeable, and familiar than concatenative TTS utterances. Furthermore, we observed individual variation in a given listener’s SPIN accuracy measures: how human- like/natural they rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech —and that these factors are linked. Overall, this work contributes to our understanding of the nexus of speech technology and human speech perception.

Download Full-text

A comparison of monolingual and bilingual toddlers’ word recognition in noise

International Journal of Bilingualism ◽

10.1177/13670069211028664 ◽

2021 ◽

pp. 136700692110286

Author(s):

Giovanna Morini ◽

Rochelle S. Newman

Keyword(s):

Background Noise ◽

Word Identification ◽

Bilingual Children ◽

Noise Levels ◽

High Noise ◽

Speech In Noise ◽

Mixed Analysis ◽

Correct Object ◽

Language Exposure ◽

And Performance

Aims and objectives: The purpose of this study was to examine whether differences in language exposure (i.e., being raised in a bilingual versus a monolingual environment) influence young children’s ability to comprehend words when speech is heard in the presence of background noise. Methodology: Forty-four children (22 monolinguals and 22 bilinguals) between the ages of 29 and 31 months completed a preferential looking task where they saw picture-pairs of familiar objects (e.g., balloon and apple) on a screen and simultaneously heard sentences instructing them to locate one of the objects (e.g., look at the apple!). Speech was heard in quiet and in the presence of competing white noise. Data and analyses: Children’s eye-movements were coded off-line to identify the proportion of time they fixated on the correct object on the screen and performance across groups was compared using a 2 × 3 mixed analysis of variance. Findings: Bilingual toddlers performed worse than monolinguals during the task. This group difference in performance was particularly clear when the listening condition contained background noise. Originality: There are clear differences in how infants and adults process speech in noise. To date, developmental work on this topic has mainly been carried out with monolingual infants. This study is one of the first to examine how background noise might influence word identification in young bilingual children who are just starting to acquire their languages. Significance: High noise levels are often reported in daycares and classrooms where bilingual children are present. Therefore, this work has important implications for learning and education practices with young bilinguals.

Download Full-text

Single-Word Speech Intelligibility in Children and Adults With Down Syndrome

American Journal of Speech-Language Pathology ◽

10.1044/2017_ajslp-17-0002 ◽

2018 ◽

Vol 27 (1) ◽

pp. 222-236 ◽

Cited By ~ 9

Author(s):

Alyssa Wild ◽

Houri K. Vorperian ◽

Ray D. Kent ◽

Daniel M. Bolt ◽

Diane Austin

Keyword(s):

Down Syndrome ◽

Speech Intelligibility ◽

Word Identification ◽

Typically Developing ◽

Single Word ◽

Vowel Production ◽

Adult Men ◽

Age Related ◽

Assessment And Treatment ◽

High Vowels

Purpose A single-word identification test was used to study speech production in children and adults with Down syndrome (DS) to determine the developmental pattern of speech intelligibility with an emphasis on vowels. Method Speech recordings were collected from 62 participants with DS aged 4–40 years and 25 typically developing participants aged 4–7 years. Panels of 5 adult lay listeners transcribed the speech recordings orthographically, and their responses were scored in comparison with the speakers' target words. Results Speech intelligibility in persons with DS improved with age, especially between the ages of 4 and 16 years. Whereas consonants contribute to intelligibility, vowels also played an important role in reduced intelligibility with an apparent developmental difference in low versus high vowels, where the vowels /æ/ and/ɑ/ developed at a later age than /i/ and /u/. Interspeaker variability was large, with male individuals being generally less intelligible than female individuals and some adult men having very low intelligibility. Conclusion Results show age-related patterns in speech intelligibility in persons with DS and identify the contribution of dimensions of vowel production to intelligibility. The methods used clarify the phonetic basis of reduced intelligibility, with implications for assessment and treatment.

Download Full-text

Electro-haptic hearing: Speech-in-noise performance in cochlear implant users is enhanced by tactile stimulation of the wrists

10.31234/osf.io/j6bn4 ◽

2019 ◽

Author(s):

Mark D. Fletcher ◽

Amatullah Hadeedi ◽

Tobias Goehring ◽

Sean R Mills

Keyword(s):

Cochlear Implant ◽

Speech Intelligibility ◽

Tactile Stimulation ◽

Electrical Signal ◽

Noise Performance ◽

Speech In Noise ◽

Non Invasive ◽

Real World Application ◽

Before And After ◽

Stimulation Of

Cochlear implant (CI) users receive only limited sound information through their implant, which means that they struggle to understand speech in noisy environments. Recent work has suggested that combining the electrical signal from the CI with a haptic signal that provides crucial missing sound information (“electro-haptic stimulation”; EHS) could improve speech-in-noise performance. The aim of the current study was to test whether EHS could enhance speech-in-noise performance in CI users using: (1) a tactile signal derived using an algorithm that could be applied in real time, (2) a stimulation site appropriate for a real-world application, and (3) a tactile signal that could readily be produced by a compact, portable device. We measured speech intelligibility in multi-talker noise with and without vibro-tactile stimulation of the wrist in CI users, before and after a short training regime. No effect of EHS was found before training, but after training EHS was found to improve the number of words correctly identified by an average of 8.3 %-points, with some users improving by more than 20 %-points. Our approach could offer an inexpensive and non-invasive means of improving speech-in-noise performance in CI users.

Download Full-text

Speech-in-noise recognition and the relation to vocal pitch perception in adults with autism spectrum disorder and typical development

10.31234/osf.io/u84vd ◽

2019 ◽

Cited By ~ 1

Author(s):

Stefanie Schelinski ◽

Katharina von Kriegstein

Keyword(s):

Autism Spectrum Disorder ◽

High Functioning Autism ◽

Autism Spectrum ◽

Pitch Perception ◽

Spectrum Disorder ◽

Control Group ◽

Noise Levels ◽

Adults With Autism ◽

Speech In Noise ◽

Vocal Pitch

We tested the ability to recognise speech-in-noise and its relation to the ability to discriminate vocal pitch in adults with high-functioning autism spectrum disorder (ASD) and typically developed adults (matched pairwise on age, sex, and IQ). Typically developed individuals understood speech in higher noise levels as compared to the ASD group. Within the control group but not within the ASD group, better speech-in-noise recognition abilities were significantly correlated with better vocal pitch discrimination abilities. Our results show that speech-in-noise recognition is restricted in people with ASD. We speculate that perceptual impairments such as difficulties in vocal pitch perception might be relevant in explaining these difficulties in ASD.

Download Full-text

Text-to-speech intelligibility across speech rates

10.21437/interspeech.2012-194 ◽

2012 ◽

Author(s):

Ann K. Syrdal ◽

H. Timothy Bunnell ◽

Susan R. Hertz ◽

Taniya Mishra ◽

Murray Spiegel ◽

...

Keyword(s):

Speech Intelligibility ◽

Text To Speech

Download Full-text

Noise and acoustic conditions of premises for hearing-impaired people in Korea

Noise Control Engineering Journal ◽

10.3397/1/37697 ◽

2021 ◽

Vol 69 (1) ◽

pp. 77-85

Author(s):

Cheol-Ho Jeong ◽

Wan-Ho Cho ◽

Ji-Ho Chang ◽

Sung-Hyun Lee ◽

Chang-Wook Kang ◽

...

Keyword(s):

Speech Perception ◽

Speech Intelligibility ◽

Hearing Impaired ◽

Noise Levels ◽

Listening Effort ◽

Subjective Data ◽

Impaired People ◽

The Republic ◽

Primary School Pupils ◽

Acoustic Conditions

Hearing-impaired people need more stringent acoustic and noise requirements than normal-hearing people in terms of speech intelligibility and listening effort. Multiple guidelines recommend a maximum reverberation time of 0.4 s in classrooms, signal-to-noise ratios (SNRs) greater than 15 dB, and ambient noise levels lower than 35 dBA. We measured noise levels and room acoustic parameters of 12 classrooms in two schools for hearing-impaired pupils, a dormitory apartment for the hearing-impaired, and a church mainly for the hearing-impaired in the Republic of Korea. Additionally, subjective speech clarity and quality of verbal communication were evaluated through questionnaires and interviews with hearing-impaired students in one school. Large differences in subjective speech perception were found between younger primary school pupils and older pupils. Subjective data from the questionnaire and interview were inconsistent; major challenges in obtaining reliable subjective speech perception and limitations of the results are discussed.

Download Full-text

Improving Students' Academic Learning by Helping Them Access Text

Advances in Medical Technologies and Clinical Practice - Recent Advances in Assistive Technologies to Support Children with Developmental Disorders ◽

10.4018/978-1-4666-8395-2.ch010 ◽

2015 ◽

pp. 217-236

Author(s):

Michael Ben-Avie ◽

Régine Randall ◽

Diane Weaver Dunne ◽

Chris Kelly

Keyword(s):

Classroom Management ◽

Word Identification ◽

Academic Learning ◽

Time On Task ◽

Text To Speech ◽

Major Drawback ◽

Human Voice ◽

Classroom Routines ◽

Conventional Methods ◽

Written Materials

Conventional methods of addressing the needs of students with print disabilities include text-to-speech services. One major drawback of text-to-speech technologies is that computerized speech simply articulates the same words in a text whereas human voice can convey emotions such as excitement, sadness, fear, or joy. Audiobooks have human narration, but are designed for entertainment and not for teaching word identification, fluency, vocabulary, and comprehension to students. This chapter focuses on the 3-year pilot of CRISKids; all CRIS recordings feature human narration. The pilot demonstrated that students who feel competent in their reading and class work tend to be more engaged in classroom routines, spend more time on task and demonstrate greater comprehension of written materials. When more demonstrate these behaviors and skills, teachers are better able to provide meaningful instruction, since less time is spent on issues of classroom management and redirection. Thus, CRISKids impacts not only the students with print disabilities, but all of the students in the classroom.

Download Full-text

Modeling Binaural Unmasking of Speech Using a Blind Binaural Processing Stage

Trends in Hearing ◽

10.1177/2331216520975630 ◽

2020 ◽

Vol 24 ◽

pp. 233121652097563

Author(s):

Christopher F. Hauth ◽

Simon C. Berning ◽

Birger Kollmeier ◽

Thomas Brand

Keyword(s):

Speech Recognition ◽

Speech Intelligibility ◽

Single Channel ◽

Signal To Noise Ratio ◽

Binaural Processing ◽

Speech In Noise ◽

Masking Level Difference ◽

Low Pass ◽

Speech Intelligibility Index ◽

Filtered Speech

The equalization cancellation model is often used to predict the binaural masking level difference. Previously its application to speech in noise has required separate knowledge about the speech and noise signals to maximize the signal-to-noise ratio (SNR). Here, a novel, blind equalization cancellation model is introduced that can use the mixed signals. This approach does not require any assumptions about particular sound source directions. It uses different strategies for positive and negative SNRs, with the switching between the two steered by a blind decision stage utilizing modulation cues. The output of the model is a single-channel signal with enhanced SNR, which we analyzed using the speech intelligibility index to compare speech intelligibility predictions. In a first experiment, the model was tested on experimental data obtained in a scenario with spatially separated target and masker signals. Predicted speech recognition thresholds were in good agreement with measured speech recognition thresholds with a root mean square error less than 1 dB. A second experiment investigated signals at positive SNRs, which was achieved using time compressed and low-pass filtered speech. The results demonstrated that binaural unmasking of speech occurs at positive SNRs and that the modulation-based switching strategy can predict the experimental results.

Download Full-text

Understanding the Legendary Sound Environment in the Lobby of Hotel Okura Tokyo

Applied Sciences ◽

10.3390/app10134552 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4552

Author(s):

Ryoko Nojima ◽

Natsuko Sugie ◽

Akira Taguchi ◽

Jun Kokubo

Keyword(s):

Speech Intelligibility ◽

Comprehensive Evaluation ◽

The Other ◽

Noise Levels ◽

Sound Environment ◽

Good Reputation ◽

The Difference ◽

Hotel Lobbies

The main lobby of Hotel Okura Tokyo has a good reputation for its sound environment, which affects the conversations of its users. We assumed that the lobby’s reputation was related to its speech intelligibility. In this study, first, the sound during hotel operations was measured to see if there was a difference in the sound environment between the lobby and the entrance hall. As a result, we clarified that the difference in noise levels affected by the degree of crowdedness of the room was smaller in the lobby than in the other rooms. Subsequently, the indoor noise and speech intelligibility were measured to clarify the correspondence of intelligibility with the lobby’s reputation. As a result, the indoor noise was found to be at a level suitable for hotel lobbies and the intelligibility was good. A comprehensive evaluation that included the results of other acoustical surveys revealed that the lobby of Okura is a space that is suitable for conversations, corresponding to the opinions of users.

Download Full-text

Effects of visual cue enhancement on speech intelligibility for clear and conversational speech in noise

The Journal of the Acoustical Society of America ◽

10.1121/1.4755679 ◽

2012 ◽

Vol 132 (3) ◽

pp. 2080-2080

Author(s):

Jasmine Beitz ◽

Kristin Van Engen ◽

Rajka Smiljanic ◽

Bharath Chandrasekaran

Keyword(s):

Speech Intelligibility ◽

Conversational Speech ◽

Visual Cue ◽

Speech In Noise

Download Full-text