Suitability of self-recordings and video calls: Vowel formants and nasal spectra

In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al., 2011; Evanini et al., Proceedings of Interspeech, 2009), Penn Aligner (Yuan and Liberman, J. Acoust. Soc. America, 2008, 123, 3878), and DARLA (Dartmouth Linguistic Automation), (Reddy and Stanford, DARLA Dartmouth Linguistic Automation: Online Tools for Linguistic Research, 2015a). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. Moreover, if sociolinguists could quickly and accurately extract phonetic information from the millions of hours of new audio content posted on the Internet every day, a virtual ocean of speech from newly created podcasts, videos, live-streams, and other audio content would now inform research. How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy et al., Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75) showed that an HMM-based Automated Speech Recognition system, trained with CMU Sphinx (Lamere et al., 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcription. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end automatic speech recognition (ASR) algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN/CTC-based DeepSpeech from Mozilla (Hannun et al., 2014). (RNN stands for recurrent neural networks, the learning mechanism for DeepSpeech. CTC stands for connectionist temporal classification, the mechanism to merge phones into words). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. Specifically, we used these techniques on audio recordings from 352 North American English speakers in the International Dialects of English Archive (IDEA1), extracting 88,500 tokens of vowels in stressed position from spontaneous, free speech passages. With this large dataset we conducted acoustic sociophonetic analyses of the Southern Vowel Shift and the Northern Cities Chain Shift in the North American IDEA speakers. We compared the results using three different sources of transcriptions: 1) IDEA’s manual transcriptions as the baseline “ground truth”, 2) the ASR built on CMU Sphinx used by Reddy et al. (Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75), and 3) the latest publicly available Mozilla DeepSpeech system. We input these three different transcriptions to DARLA, which automatically aligned and extracted the vowel formants from the 352 IDEA speakers. Our quantitative results show that newer ASR systems like DeepSpeech show considerable promise for sociolinguistic applications like DARLA. We found that DeepSpeech’s automated transcriptions had significantly fewer character error rates than those from the prior Sphinx system (from 46 to 35%). When we performed the sociolinguistic analysis of the extracted vowel formants from DARLA, we found that the automated transcriptions from DeepSpeech matched the results from the ground truth for the Southern Vowel Shift (SVS): five vowels showed a shift in both transcriptions, and two vowels didn’t show a shift in either transcription. The Northern Cities Shift (NCS) was more difficult to detect, but ground truth and DeepSpeech matched for four vowels: One of the vowels showed a clear shift, and three showed no shift in either transcription. Our study therefore shows how technology has made progress toward greater automation in vowel sociophonetics, while also showing what remains to be done. Our statistical modeling provides a quantified view of both the abilities and the limitations of a completely “hands-free” analysis of vowel shifts in a large dataset. Naturally, when comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand [Kendall and Joseph, Towards best practices in sociophonetics (with Marianna DiPaolo), 2014]. The amount of “noise” that can be tolerated for a given study will depend on the particular research goals and researchers’ preferences. Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.

Download Full-text

Cross-Gender Differences in English/French Bilingual Speakers: A Multiparametric Study

Perceptual and Motor Skills ◽

10.1177/0031512520973514 ◽

2020 ◽

pp. 003151252097351

Author(s):

Erwan Pépiot ◽

Aron Arnold

Keyword(s):

Gender Difference ◽

Voice Onset Time ◽

Second Harmonic ◽

Onset Time ◽

Intensity Difference ◽

Bilingual Speakers ◽

Language And Gender ◽

Vowel Formants ◽

Back Vowel ◽

And Gender

The present study concerns speech productions of female and male English/French bilingual speakers in both reading and semi-spontaneous speech tasks. We investigated various acoustic parameters: average fundamental sound frequency (F0), F0 range, F0 variance ( SD), vowel formants (F1, F2, and F3), voice onset time (VOT) and H1-H2 (intensity difference between the first and the second harmonic frequencies, used to measure phonation type) in both languages. Our results revealed a significant effect of gender and language on all parameters. Overall, average F0 was higher in French while F0 modulation was stronger in English. Regardless of language, female speakers exhibited higher F0 than male speakers. Moreover, the higher average F0 in French was larger in female speakers. On the other hand, the smaller F0 modulation in French was stronger in male speakers. The analysis of vowel formants showed that overall, female speakers exhibited higher values than males. However, we found a significant cross-gender difference on F2 of the back vowel [u:] in English, but not on the vowel [u] in French. VOT of voiceless stops was longer in Female speakers in both languages, with a greater difference in English. VOT contrast between voiceless stops and their voiced counterparts was also significantly longer in female speakers in both languages. The scope of this cross-gender difference was greater in English. H1-H2 was higher in female speakers in both languages, indicating a breathier phonation type. Furthermore, female speakers tended to exhibit smaller H1-H2 in French, while the opposite was true in males. This resulted in a smaller cross-gender difference in French for this parameter. All these data support the idea of language- and gender-specific vocal norms, to which bilingual speakers seem to adapt. This constitutes a further argument to give social factors, such as gender dynamics, more consideration in phonetic studies.

Download Full-text

Investigating potential interactions between envelope following responses elicited simultaneously by different vowel formants

Hearing Research ◽

10.1016/j.heares.2019.05.005 ◽

2019 ◽

Vol 380 ◽

pp. 35-45 ◽

Cited By ~ 2

Author(s):

Vijayalakshmi Easwar ◽

Susan Scollie ◽

David Purcell

Keyword(s):

Vowel Formants ◽

Potential Interactions

Download Full-text

A multivariate spatial analysis of vowel formants in American English

Journal of Linguistic Geography ◽

10.1017/jlg.2013.3 ◽

2013 ◽

Vol 1 (1) ◽

pp. 31-51 ◽

Cited By ~ 14

Author(s):

Jack Grieve ◽

Dirk Speelman ◽

Dirk Geeraerts

Keyword(s):

United States ◽

Spatial Analysis ◽

American English ◽

Western United States ◽

The West ◽

Acoustic Data ◽

Vowel Formant ◽

Multivariate Spatial Analysis ◽

Vowel Formants ◽

Vowel Shift

This paper presents the results of a multivariate spatial analysis of thirty-eight vowel formant variables measured in 236 cities from across the contiguous United States, based on the acoustic data from the Atlas of North American English. The results of the analysis both confirm and challenge the results of the Atlas. Most notably, while the analysis identifies similar patterns as the Atlas in the West and the Southeast, the analysis finds that the Midwest and the Northeast are distinct dialect regions that are considerably stronger than the traditional Midland dialect region identified in the Atlas. The analysis also finds evidence that a vowel shift is actively shaping the language of the Western United States.

Download Full-text

Lip Movements for an Unfamiliar Vowel: Mandarin Front Rounded Vowel Produced by Japanese Speakers

Journal of Speech Language and Hearing Research ◽

10.1044/2015_jslhr-s-15-0033 ◽

2016 ◽

Vol 59 (6) ◽

Cited By ~ 1

Author(s):

Haruka Saito

Keyword(s):

Acoustic Analysis ◽

Japanese Adult ◽

Japanese Speakers ◽

Adult Participants ◽

Vowel Formants ◽

Post Hoc ◽

Repetition Task

Purpose The study was aimed at investigating what kind of lip positions are selected by Japanese adult participants for an unfamiliar Mandarin rounded vowel /y/ and if their lip positions are related to and/or differentiated from those for their native vowels. Method Videotaping and post hoc tracking measurements for lip positions, namely protrusion and vertical aperture, and acoustic analysis of vowel formants were conducted on participants' production in a repetition task. Results First, 31.2% of all productions of /y/ were produced with either protruded or compressed rounding. Second, the lip positions for /y/ were differentiated from those for the perceived nearest native vowel; although they correlated with them in terms of vertical aperture, they did not in terms of protrusion/retraction. Conclusions Lip positions for a novel rounded vowel seemed to be produced as a modification of existing lip positions from the native repertoire. Moreover, the degree of vertical aperture might be easily transferred, and the degree of protrusion is less likely to be retained in the new lip positions.

Download Full-text