acoustic speech signal
Recently Published Documents


TOTAL DOCUMENTS

26
(FIVE YEARS 7)

H-INDEX

7
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Christoph Wagner ◽  
Petr Schaffer ◽  
Pouriya Amini Digehsara ◽  
Michael Bärhold ◽  
Dirk Plettemeier ◽  
...  

Abstract Recovering speech in the absence of the acoustic speech signal itself, i.e., silent speech, holds great potential for restoring or enhancing oral communication in those who lost it. Radar is a relatively unexplored silent speech sensing modality, even though it has the advantage of being fully non-invasive. We therefore built a custom stepped frequency continuous wave radar hardware to measure the changes in the transmission spectra during speech between three antennas, located on both cheeks and the chin with a measuring frequency of 100 Hz. We then recorded a command word corpus of 40 phonetically balanced, two-syllable German words and the German digits zero to nine for two individual speakers and evaluated both the speaker-dependent multi-session and inter-session recognition accuracies on this 50-word corpus using a bidirectional long-short term memory network. We obtained recognition accuracies of 99.17 % and 88.87 % for the speaker-dependent multi-session and inter-session accuracy, respectively. These results show that the transmission spectra are very well suited to discriminate individual words from one another, even across different sessions, which is one of the key challenges for fully non-invasive silent speech interfaces.


2021 ◽  
Author(s):  
Seung-Goo Kim ◽  
Federico De Martino ◽  
Tobias Overath

Speech comprehension entails the neural mapping of the acoustic speech signal onto learned linguistic units. This acousto-linguistic transformation is bi-directional, whereby higher-level linguistic processes (e.g., semantics) modulate the acoustic analysis of individual linguistic units. Here, we investigated the cortical topography and linguistic modulation of the most fundamental linguistic unit, the phoneme. We presented natural speech and 'phoneme quilts' (pseudo-randomly shuffled phonemes) in either a familiar (English) or unfamiliar (Korean) language to native English speakers while recording fMRI. This design dissociates the contribution of acoustic and linguistic processes towards phoneme analysis. We show that (1) the four main phoneme classes (vowels, nasals, plosives, fricatives) are differentially and topographically encoded in human auditory cortex, and that (2) their acoustic analysis is modulated by linguistic analysis. These results suggest that the linguistic modulation of cortical sensitivity to phoneme classes minimizes prediction error during natural speech perception, thereby aiding speech comprehension in challenging listening situations.


Author(s):  
Dea Sifana Ramadhina ◽  
Rita Magdalena ◽  
Sofia Saidah

Voice is one of the parameters in the identification process of a person. Through the voice, information will be obtained such as gender, age, and even the identity of the speaker. Speaker recognition is a method to narrow down crimes and frauds committed by voice. So that it will minimize the occurrence of faking one's identity. The Method of Mel Frequency Cepstrum Coefficient (MFCC) can be used in the speech recognition system. The process of feature extraction of speech signal using MFCC will produce acoustic speech signal. The classification, Hidden Markov Models (HMM) is used to match unidentified speaker’s voice with the voices in database. In this research, the system is used to verify the speaker, namely 15 text dependent in Indonesian. On testing the speaker with the same as database, the highest accuracy is 99,16%.


2020 ◽  
Vol 117 (51) ◽  
pp. 32791-32798
Author(s):  
Chris Scholes ◽  
Jeremy I. Skipper ◽  
Alan Johnston

It is well established that speech perception is improved when we are able to see the speaker talking along with hearing their voice, especially when the speech is noisy. While we have a good understanding of where speech integration occurs in the brain, it is unclear how visual and auditory cues are combined to improve speech perception. One suggestion is that integration can occur as both visual and auditory cues arise from a common generator: the vocal tract. Here, we investigate whether facial and vocal tract movements are linked during speech production by comparing videos of the face and fast magnetic resonance (MR) image sequences of the vocal tract. The joint variation in the face and vocal tract was extracted using an application of principal components analysis (PCA), and we demonstrate that MR image sequences can be reconstructed with high fidelity using only the facial video and PCA. Reconstruction fidelity was significantly higher when images from the two sequences corresponded in time, and including implicit temporal information by combining contiguous frames also led to a significant increase in fidelity. A “Bubbles” technique was used to identify which areas of the face were important for recovering information about the vocal tract, and vice versa,on a frame-by-frame basis. Our data reveal that there is sufficient information in the face to recover vocal tract shape during speech. In addition, the facial and vocal tract regions that are important for reconstruction are those that are used to generate the acoustic speech signal.


2020 ◽  
Vol 2020 (5) ◽  
pp. 49-55
Author(s):  
Natalya Kholkina

In the paper shown there is presented an approach to the solution of the problem of the effectiveness parameter assessment in telecommunication systems of operational and command communication, systems of warning speakerphone and audio-exchange. There are considered the matters of the dependence investigation of the acoustic speech signal/noise ratio to the assurance of the required syllabic legibility for the possibility to increase the function effectiveness of telecommunication systems and information exchange operated under complex noise situation. There is shown the dependence of formant legibility upon the meaning of average geometric frequencies in each i-th band of a frequency spectrum of acoustic speech signals. The degree of the impact upon syllabic legibility of the acoustic speech signal/noise ratio is shown. In the paper it is shown that for obtaining speech information with syllabic legibility higher than 93% required for complete perception by a subscriber it is necessary to ensure the acoustic signal/noise ratio at the level no less than 20 dB. The problems in the probability density approximation of acoustic signals with the use of generalized polynomials on function basis systems are presented.


2019 ◽  
Vol 30 (2) ◽  
pp. 618-627 ◽  
Author(s):  
Deborah F Levy ◽  
Stephen M Wilson

AbstractSpeech perception involves mapping from a continuous and variable acoustic speech signal to discrete, linguistically meaningful units. However, it is unclear where in the auditory processing stream speech sound representations cease to be veridical (faithfully encoding precise acoustic properties) and become categorical (encoding sounds as linguistic categories). In this study, we used functional magnetic resonance imaging and multivariate pattern analysis to determine whether tonotopic primary auditory cortex (PAC), defined as tonotopic voxels falling within Heschl’s gyrus, represents one class of speech sounds—vowels—veridically or categorically. For each of 15 participants, 4 individualized synthetic vowel stimuli were generated such that the vowels were equidistant in acoustic space, yet straddled a categorical boundary (with the first 2 vowels perceived as [i] and the last 2 perceived as [i]). Each participant’s 4 vowels were then presented in a block design with an irrelevant but attention-demanding level change detection task. We found that in PAC bilaterally, neural discrimination between pairs of vowels that crossed the categorical boundary was more accurate than neural discrimination between equivalently spaced vowel pairs that fell within a category. These findings suggest that PAC does not represent vowel sounds veridically, but that encoding of vowels is shaped by linguistically relevant phonemic categories.


2019 ◽  
Vol 16 (01) ◽  
pp. 1950013
Author(s):  
Oliver Niebuhr ◽  
Anush Norika Nazaryan

Our study is a first step toward the innovative further development of mobile phones with special emphasis on optimizing them for business communication. Traditional landline phones and mobile phones up to 3G technology are known to trigger the so-called “telephone voice”. The phonetic changes induced by the telephone voice (louder speech at a higher pitch level) are suitable for undermining the perceived competence, trustworthiness and charisma of a speaker and can, thus, negatively influence business actions over the mobile phone. In a speech production experiment with 20 speakers and a subsequent acoustic speech-signal analysis of almost 15 000 utterances, we tested in comparison to a baseline face-to-face dialog condition, whether the telephone voice still exists in a technological setting of VoLTE 4G mobile-phone communication. In fact, we found that the typical characteristics of the telephone voice persist even under the currently best technological 4G standards and under silent communication conditions. Moreover, we identified further acoustic-phonetic parameters of the telephone voice, some of which (like a more monotonous intonation) further compound the problem of business communication over the mobile phone. In combination, the extended parametric picture and the persistent occurrence of the “telephone voice” even under quiet 4G conditions suggest that a speech-in-noise-like (i.e. Lombard) adaption is not the only and perhaps not even the primary cause behind the telephone voice. Based on this, we propose a number of innovations and R&D activities for making mobile-phone technology more suitable for business communication.


2016 ◽  
Vol 1 (1) ◽  
pp. 139-150
Author(s):  
Robert Wielgat ◽  
Anita Lorenc

Electromagnetic Articulography (EMA) is a precise method for speech articulators assessment which is carried out by sensors placed mainly on the tongue. Various methods are being developed in order to avoid the assessment by EMA sensors. One of them is speech inversion. Here preliminary research on speech inversion based on dynamic time warping (DTW) method has been described. Mel-frequency cepstral coefficients (MFCC) method has been chosen as the acoustic speech signal parametrization method. Root mean square errors (RMSE) of the evaluation have been presented and discussed.


Sign in / Sign up

Export Citation Format

Share Document