scholarly journals Speech Enhancement Based on Linear Prediction and Correlation-Inputting Bias Free Equation Error ADF

Author(s):  
Naoto Sasaoka ◽  
Shinichi Wada ◽  
James Okello ◽  
Yoshio Itoh ◽  
Masaki Kobayashi

In this paper, a speech enhancement technique to reduce background noise in noisy speech is proposed. We investigated the noise reconstruction system (NRS) based on linear prediction and system identification as a speech enhancement. Assuming that the background noise is generated from white noise by exciting a linear filter, the system identification estimates the background noise from estimated white noise. However, the white noise estimated by a linear prediction error filter (LPEF) includes residual speech, then the estimation accuracy of background noise is degraded at the system identification and the quality of enhanced speech is deteriorated. In order to reduce the influence of the residual speech, a lattice filter and a bias free equation error adaptive digital filter (ADF) are respectively introduced to the LPEF and system identification. The residual speech is reduced by the lattice filter which approximates a vocal-tract filter well. On the other hand, the bias free equation error ADF uses the cross-correlation between the whitened noise and a desired signal as a tap input. Since the speech does not have the correlation from the desired signal, the tap coefficients converge without the influence of speech.

Acoustics ◽  
2019 ◽  
Vol 1 (3) ◽  
pp. 711-725 ◽  
Author(s):  
Nikolaos Kilis ◽  
Nikolaos Mitianoudis

This paper presents a novel scheme for speech dereverberation. The core of our method is a two-stage single-channel speech enhancement scheme. Degraded speech obtains a sparser representation of the linear prediction residual in the first stage of our proposed scheme by applying orthogonal matching pursuit on overcomplete bases, trained by the K-SVD algorithm. Our method includes an estimation of reverberation and mixing time from a recorded hand clap or a simulated room impulse response, which are used to create a time-domain envelope. Late reverberation is suppressed at the second stage by estimating its energy from the previous envelope and removed with spectral subtraction. Further speech enhancement is applied on minimizing the background noise, based on optimal smoothing and minimum statistics. Experimental results indicate favorable quality, compared to two state-of-the-art methods, especially in real reverberant environments with increased reverberation and background noise.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1888
Author(s):  
Juraj Kacur ◽  
Boris Puterka ◽  
Jarmila Pavlovicova ◽  
Milos Oravec

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.


Signals ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 434-455
Author(s):  
Sujan Kumar Roy ◽  
Kuldip K. Paliwal

Inaccurate estimates of the linear prediction coefficient (LPC) and noise variance introduce bias in Kalman filter (KF) gain and degrade speech enhancement performance. The existing methods propose a tuning of the biased Kalman gain, particularly in stationary noise conditions. This paper introduces a tuning of the KF gain for speech enhancement in real-life noise conditions. First, we estimate noise from each noisy speech frame using a speech presence probability (SPP) method to compute the noise variance. Then, we construct a whitening filter (with its coefficients computed from the estimated noise) to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. We then construct the KF with the estimated parameters, where the robustness metric offsets the bias in KF gain during speech absence of noisy speech to that of the sensitivity metric during speech presence to achieve better noise reduction. The noise variance and the speech model parameters are adopted as a speech activity detector. The reduced-biased Kalman gain enables the KF to minimize the noise effect significantly, yielding the enhanced speech. Objective and subjective scores on the NOIZEUS corpus demonstrate that the enhanced speech produced by the proposed method exhibits higher quality and intelligibility than some benchmark methods.


2005 ◽  
Vol 83 (7) ◽  
pp. 721-737
Author(s):  
H Teffahi ◽  
B Guerin ◽  
A Djeradi

Knowledge of vocal tract area functions is important for the understanding of phenomena occurring during speech production. We present here a new measurement method based on the external excitation of the vocal tract with a known pseudo-random sequence, where the area function is obtained by a linear prediction analysis applied to the cross-correlation between the sequence and the signal measured at the lips. The advantages of this method over methods based on sweep-tones or white noise excitation are (1) a much shorter measurement time (about 100 ms) and (2) the possibility of speech sound production during the measurement. This method has been checked against classical methods through systematic comparisons on a small corpus of vowels. Moreover, it has been verified that simultaneous speech sound production does not perturb significantly the measurements. This method should thus be a very helpful tool for the investigation of the acoustic properties of the vocal tract in various cases for vowels.


2018 ◽  
Vol 29 (1) ◽  
pp. 565-582
Author(s):  
T.R. Jayanthi Kumari ◽  
H.S. Jayanna

Abstract In many biometric applications, limited data speaker verification plays a significant role in practical-oriented systems to verify the speaker. The performance of the speaker verification system needs to be improved by applying suitable techniques to limited data condition. The limited data represent both train and test data duration in terms of few seconds. This article shows the importance of the speaker verification system under limited data condition using feature- and score-level fusion techniques. The baseline speaker verification system uses vocal tract features like mel-frequency cepstral coefficients, linear predictive cepstral coefficients and excitation source features like linear prediction residual and linear prediction residual phase as features along with i-vector modeling techniques using the NIST 2003 data set. In feature-level fusion, the vocal tract features are fused with excitation source features. As a result, on average, equal error rate (EER) is approximately equal to 4% compared to individual feature performance. Further in this work, two different types of score-level fusion are demonstrated. In the first case, fusing the scores of vocal tract features and excitation source features at score-level-maintaining modeling technique remains the same, which provides an average reduction approximately equal to 2% EER compared to feature-level fusion performance. In the second case, scores of the different modeling techniques are combined, which has resulted in EER reduction approximately equal to 4.5% compared with score-level fusion of different features.


Author(s):  
Radhika Rani L ◽  
S. Chandra lingam ◽  
Anjaneyulu T ◽  
Satyanarayana K

Congenital Heart Defects (CHD) are the critical heart disorders that can be observed at the birth stage of the infants. These are classified mainly into two, Cyanotic and Acyanotic. The present paper concentrates on the Acyanotic heart disorders. Acyanotic heart disorder cannot be observed on external checkup, whereas bluish skin is the indication for the infant affected with Cyanotic disorder. Acyanotic heart disorder can only be diagnosed using chest X-Ray, ECG, Echocardiogram, Cardiac Catheterization and MRI of the Heart. The present work aims at estimating the fundamental frequency (pitch) and the vocal tract resonant frequencies (formants) from the cry signal of the infants. The pitch frequency and formant frequencies are estimated using frequency domain (Cepstrum) and Linear Prediction Code (LPC) methods. The results show that the fundamental frequency of the cry signal was between 600Hz-800Hz for the infants with Acyanotic heart disorders. This fundamental frequency helps in identifying Acyanotic heart disorders at an early stage.


1973 ◽  
Vol 59 (2) ◽  
pp. 415-424
Author(s):  
PER S. ENGER

1. The nervous activity of single auditory neurones in goldfish brain have been measured. 2. Four types of acoustic stimuli were used, (1) pure tones, (2) noise of one-third octave band width, (3) noise of one-octave band width with centre frequency equal to the pure tone, and (4) white noise. 3. Except for white noise, these stimuli produced the same response to equal sound pressures. The white noise response was less, presumably because the frequency range covered by a single neurone is far narrower than the range of white noise. 4. The conclusion has been reached that for low-frequency acoustic signals, the acoustic power over a frequency band of one to two octaves is integrated by the nervous system. 5. The masking effect of background noise on the acoustic threshold of single units to pure tones is strongest when the noise band has the same centre frequency as the test tone. In this case the tone threshold increases linearly with the background noise level. 6. When the noise band was centred at a different frequency from the tone, the masking effect decreased at a rate of 20-22 dB/octave for the first one-third octave for a tone frequency of 250 Hz. For a tone of 500 Hz the masking effect of lower frequencies was stronger and was reduced by only some 9 dB/octave for the first one-third octave.


Sign in / Sign up

Export Citation Format

Share Document