acoustic similarity
Recently Published Documents


TOTAL DOCUMENTS

135
(FIVE YEARS 26)

H-INDEX

21
(FIVE YEARS 2)

Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8313
Author(s):  
Łukasz Lepak ◽  
Kacper Radzikowski ◽  
Robert Nowak ◽  
Karol J. Piczak

Models for keyword spotting in continuous recordings can significantly improve the experience of navigating vast libraries of audio recordings. In this paper, we describe the development of such a keyword spotting system detecting regions of interest in Polish call centre conversations. Unfortunately, in spite of recent advancements in automatic speech recognition systems, human-level transcription accuracy reported on English benchmarks does not reflect the performance achievable in low-resource languages, such as Polish. Therefore, in this work, we shift our focus from complete speech-to-text conversion to acoustic similarity matching in the hope of reducing the demand for data annotation. As our primary approach, we evaluate Siamese and prototypical neural networks trained on several datasets of English and Polish recordings. While we obtain usable results in English, our models’ performance remains unsatisfactory when applied to Polish speech, both after mono- and cross-lingual training. This performance gap shows that generalisation with limited training resources is a significant obstacle for actual deployments in low-resource languages. As a potential countermeasure, we implement a detector using audio embeddings generated with a generic pre-trained model provided by Google. It has a much more favourable profile when applied in a cross-lingual setup to detect Polish audio patterns. Nevertheless, despite these promising results, its performance on out-of-distribution data are still far from stellar. It would indicate that, in spite of the richness of internal representations created by more generic models, such speech embeddings are not entirely malleable to cross-language transfer.


2021 ◽  
Author(s):  
Francesca Ronchini ◽  
Romain Serizel ◽  
Nicolas Turpault ◽  
Samuele Cornell

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases).The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.


2021 ◽  
Vol 176 ◽  
pp. 77-86
Author(s):  
Ian P. Thomas ◽  
Stéphanie M. Doucet ◽  
D. Ryan Norris ◽  
Amy E.M. Newman ◽  
Heather Williams ◽  
...  

2021 ◽  
Vol 175 ◽  
pp. 111-121
Author(s):  
Yingtong Wu ◽  
Anna L. Petrosky ◽  
Nicolas A. Hazzi ◽  
Rebecca Lynn Woodward ◽  
Luis Sandoval

2021 ◽  
pp. 026765832110089
Author(s):  
Daniel J Olson

Featural approaches to second language phonetic acquisition posit that the development of new phonetic norms relies on sub-phonemic features, expressed through a constellation of articulatory gestures and their corresponding acoustic cues, which may be shared across multiple phonemes. Within featural approaches, largely supported by research in speech perception, debate remains as to the fundamental scope or ‘size’ of featural units. The current study examines potential featural relationships between voiceless and voiced stop consonants, as expressed through the voice onset time cue. Native English-speaking learners of Spanish received targeted training on Spanish voiceless stop consonant production through a visual feedback paradigm. Analysis focused on the change in voice onset time, for both voiceless (i.e. trained) and voiced (i.e. non-trained) phonemes, across the pretest, posttest, and delayed posttest. The results demonstrated a significant improvement (i.e. reduction) in voice onset time for voiceless stops, which were subject to the training paradigm. In contrast, there was no significant change in the non-trained voiced stop consonants. These results suggest a limited featural relationship, with independent voice onset time (VOT) cues for voiceless and voices phonemes. Possible underlying mechanisms that limit feature generalization in second language (L2) phonetic production, including gestural considerations and acoustic similarity, are discussed.


2021 ◽  
Vol 1 (4) ◽  
pp. 040802
Author(s):  
David J. Forman ◽  
Tracianne B. Neilsen ◽  
David F. Van Komen ◽  
David P. Knobles

Languages ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 44
Author(s):  
Jaydene Elvin ◽  
Daniel Williams ◽  
Jason A. Shaw ◽  
Catherine T. Best ◽  
Paola Escudero

This study tests whether Australian English (AusE) and European Spanish (ES) listeners differ in their categorisation and discrimination of Brazilian Portuguese (BP) vowels. In particular, we investigate two theoretically relevant measures of vowel category overlap (acoustic vs. perceptual categorisation) as predictors of non-native discrimination difficulty. We also investigate whether the individual listener’s own native vowel productions predict non-native vowel perception better than group averages. The results showed comparable performance for AusE and ES participants in their perception of the BP vowels. In particular, discrimination patterns were largely dependent on contrast-specific learning scenarios, which were similar across AusE and ES. We also found that acoustic similarity between individuals’ own native productions and the BP stimuli were largely consistent with the participants’ patterns of non-native categorisation. Furthermore, the results indicated that both acoustic and perceptual overlap successfully predict discrimination performance. However, accuracy in discrimination was better explained by perceptual similarity for ES listeners and by acoustic similarity for AusE listeners. Interestingly, we also found that for ES listeners, the group averages explained discrimination accuracy better than predictions based on individual production data, but that the AusE group showed no difference.


Sign in / Sign up

Export Citation Format

Share Document