voiced speech
Recently Published Documents


TOTAL DOCUMENTS

225
(FIVE YEARS 13)

H-INDEX

20
(FIVE YEARS 1)

2021 ◽  
Vol 32 (07) ◽  
pp. 445-463
Author(s):  
Richard H. Wilson ◽  
Nancy J. Scherer

Abstract Background The amplitude and temporal asymmetry of the speech waveform are mostly associated with voiced speech utterances and are obvious in recent graphic depictions in the literature. The asymmetries are attributed to the presence and interactions of the major formants characteristic of voicing with possible contributions from the unidirectional air flow that accompanies speaking. Purpose This study investigated the amplitude symmetry/asymmetry characteristics (polarity) of speech waveforms that to our knowledge have not been quantified. Study Sample Thirty-six spondaic words spoken by two male speakers and two female speakers were selected because they were multisyllabic words providing a reasonable sampling of speech sounds and four recordings were available that were not related to the topic under study. Research Design Collectively, the words were segmented into phonemes (vowels [130], diphthongs [77], voiced consonants [258], voiceless consonants [219]), syllables (82), and blends (6). For each segment the following were analyzed separately for the positive and negative datum points: peak amplitude, the percent of the total segment datum points, the root-mean-square (rms) amplitude, and the crest factor. Data Collection and Analyses The digitized words (44,100 samples/s; 16-bit) were parsed into 144 files (36 words × 4 speakers), edited, transcribed to numeric values (±1), and stored in a spread sheet in which all analyses were performed with in-house routines. Overall approximately 85% of each waveform was analyzed, which excluded portions of silent intervals, transitions, and diminished waveform endings. Results The vowel, diphthong, and syllable segments had durations (180–220 ms) that were about twice as long as the consonant durations (∼90 ms) and peak and rms amplitudes that were 6 to 12 dB higher than the consonant peak and rms amplitudes. Vowel, diphthong, and syllable segments had 10% more positive datum points (55%) than negative points (45%), which suggested temporal asymmetries within the segments. With voiced consonants, the distribution of positive and negative datum points dropped to 52 and 48% and essentially was equal with the voiceless consonants (50.3 and 49.6%). The mean rms amplitudes of the negative datum points were higher than the rms amplitudes for the positive points by 2 dB (vowels, diphthongs, and syllables), 1 dB (voiced consonants), and 0.1 dB (voiceless consonants). The 144 waveforms and segmentations are illustrated in the Supplementary Material along with the tabularized positive and negative segment characteristics. Conclusions The temporal and amplitude waveform asymmetries were by far most notable in segments that had a voicing component, which included the voiced consonants. These asymmetries were characterized by larger envelopes and more energy in the negative side of the waveform segment than in the positive side. Interestingly, these segments had more positive datum points than negative points, which indicated temporal asymmetry. All aspects of the voiceless consonants were equally divided between the positive and negative domains. There were female/male differences but with these limited samples such differences should not be generalized beyond the speakers in this study. The influence of the temporal and amplitude asymmetries on monaural word-recognition performance is thought to be negligible.


2019 ◽  
Vol 48 (3) ◽  
pp. 446-453
Author(s):  
Milan Sigmund

In this article, we investigate a specific long-term speech spectrum with respect to its use for speaker recognition. The long-term effect was satisfied by averaging short-term autocorrelation coefficients over the whole utterance. The long-term spectrum was calculated by means of second-order linear prediction using the average autocorrelation coefficients. First, speaker discriminability of 32 individual parameters was evaluated by combining spectral energy and spectral slope in eight different frequency bands covering the range 0−4 kHz (seven narrow nonoverlapping subbands and one band spanning over the full range). Then, four subbands with the most discriminative capability were selected for speaker recognition. These subbands involve the frequencies of 0−1.2 kHz in total. In the main experiments, text-independent speaker recognition based on relative Euclidean distance was performed in each single subband as well as in multiple 2 to 4 subbands applying two types of speech data, complete continuous speech and voiced part of the same speech. The voiced speech seems to be generally more effective for speaker recognition using the long-term speech spectrum. The best recognition rates, i.e. 91.7% on complete speech and 100% on voiced speech, were achieved in optimal paired subbands. The long-term speech spectrum can complement the traditional voice features.


2019 ◽  
Vol 9 (17) ◽  
pp. 3562
Author(s):  
Judith Probst ◽  
Alexander Lodermeyer ◽  
Sahar Fattoum ◽  
Stefan Becker ◽  
Matthias Echternach ◽  
...  

Voiced speech is the result of a fluid-structure-acoustic interaction in larynx and vocal tract (VT). Previous studies show a strong influence of the VT on this interaction process, but are limited to individually obtained VT geometries. In order to overcome this restriction and to provide a more general VT replica, we computed a simplified, averaged VT geometry for the vowel /a/. The basis for that were MRI-derived cross-sections along the straightened VT centerline of six professional tenors. The resulting mean VT replica, as well as realistic and simplified VT replicas of each tenor were 3D-printed for experiments with silicone vocal folds that show flow-induced oscillations. Our results reveal that all replicas, including the mean VT, reproduce the characteristic formants with mean deviations of 12% when compared with the subjects’ audio recordings. The overall formant structure neither is impaired by the averaging process, nor by the simplified geometry. Nonetheless, alterations in the broadband, non-harmonic portions of the sound spectrum indicate changed aerodynamic characteristics within the simplified VT. In conclusion, our mean VT replica shows similar formant properties as found in vivo. This indicates that the mean VT geometry is suitable for further investigations of the fluid-structure-acoustic interaction during phonation.


2019 ◽  
Vol 1 (8) ◽  
Author(s):  
Jihen Zeremdini ◽  
Mohamed Anouar Ben Messaoud ◽  
Aicha Bouzid

Sign in / Sign up

Export Citation Format

Share Document