scholarly journals Voice Conversion Using a Perceptual Criterion

2020 ◽  
Vol 10 (8) ◽  
pp. 2884
Author(s):  
Ki-Seung Lee

In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method.

2022 ◽  
Vol 12 (2) ◽  
pp. 827
Author(s):  
Ki-Seung Lee

Moderate performance in terms of intelligibility and naturalness can be obtained using previously established silent speech interface (SSI) methods. Nevertheless, a common problem associated with SSI has involved deficiencies in estimating the spectrum details, which results in synthesized speech signals that are rough, harsh, and unclear. In this study, harmonic enhancement (HE), was used during postprocessing to alleviate this problem by emphasizing the spectral fine structure of speech signals. To improve the subjective quality of synthesized speech, the difference between synthesized and actual speech was established by calculating the distance in the perceptual domains instead of using the conventional mean square error (MSE). Two deep neural networks (DNNs) were employed to separately estimate the speech spectra and the filter coefficients of HE, connected in a cascading manner. The DNNs were trained to incrementally and iteratively minimize both the MSE and the perceptual distance (PD). A feasibility test showed that the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility measure (STOI) were improved by 17.8 and 2.9%, respectively, compared with previous methods. Subjective listening tests revealed that the proposed method yielded perceptually preferred results compared with that of the conventional MSE-based method.


2020 ◽  
Vol 10 (17) ◽  
pp. 6039
Author(s):  
María Navarro-Cáceres ◽  
Javier Félix Merchán Sánchez-Jara ◽  
Valderi Reis Quietinho Leithardt ◽  
Raúl García-Ovejero

In Western tonal music, tension in chord progressions plays an important role in defining the path that a musical composition should follow. The creation of chord progressions that reflects such tension profiles can be challenging for novice composers, as it depends on many subjective factors, and also is regulated by multiple theoretical principles. This work presents ChordAIS-Gen, a tool to assist the users to generate chord progressions that comply with a concrete tension profile. We propose an objective measure capable of capturing the tension profile of a chord progression according to different tonal music parameters, namely, consonance, hierarchical tension, voice leading and perceptual distance. This measure is optimized into a Genetic Program algorithm mixed with an Artificial Immune System called Opt-aiNet. Opt-aiNet is capable of finding multiple optima in parallel, resulting in multiple candidate solutions for the next chord in a sequence. To validate the objective function, we performed a listening test to evaluate the perceptual quality of the candidate solutions proposed by our system. Most listeners rated the chord progressions proposed by ChordAIS-Gen as better candidates than the progressions discarded. Thus, we propose to use the objective values as a proxy for the perceptual evaluation of chord progressions and compare the performance of ChordAIS-Gen with chord progressions generators.


Author(s):  
Mourad Talbi ◽  
Med Salim Bouhlel

Background: In this paper, we propose a secure image watermarking technique which is applied to grayscale and color images. It consists in applying the SVD (Singular Value Decomposition) in the Lifting Wavelet Transform domain for embedding a speech image (the watermark) into the host image. Methods: It also uses signature in the embedding and extraction steps. Its performance is justified by the computation of PSNR (Pick Signal to Noise Ratio), SSIM (Structural Similarity), SNR (Signal to Noise Ratio), SegSNR (Segmental SNR) and PESQ (Perceptual Evaluation Speech Quality). Results: The PSNR and SSIM are used for evaluating the perceptual quality of the watermarked image compared to the original image. The SNR, SegSNR and PESQ are used for evaluating the perceptual quality of the reconstructed or extracted speech signal compared to the original speech signal. Conclusion: The Results obtained from computation of PSNR, SSIM, SNR, SegSNR and PESQ show the performance of the proposed technique.


2019 ◽  
Vol 24 (4) ◽  
pp. 728-735
Author(s):  
Mourad Talbi ◽  
Med Salim Bouhlel

In this paper, a new speech compression technique is proposed. This technique applies a Psychoacoustic Model and a general approach for Filter Bank Design using optimization. It is evaluated and compared with a compression technique using a MDCT (Modified Discrete Cosine Transform) Filter Bank of 32 Filters and a Psychoacoustic Model. This evaluation and comparison is performed by calculating bits before and after compression, PSNR (Peak Signal to Noise Ratio), NRMSE (Normalized Root Mean Square Error), SNR (Signal to Noise Ratio) and PESQ (Perceptual evaluation of speech quality) computations. The two techniques are tested and applied to a number of speech signals that are sampled at 8 kHz. The results obtained from this evaluation show that the proposed technique outperforms the second compression technique (based on a Psychoacoustic Model and MDCT filter Bank) in terms of Bits after compression and compression ratio. In fact, the proposed technique yields higher values for the compression ratio than the second compression technique. Moreover, the proposed compression technique presents reconstructed speech signals with acceptable perceptual qualities. This is justified by the values of SNR, PSNR and NRMSE and PESQ.


2012 ◽  
Vol 2012 ◽  
pp. 1-12 ◽  
Author(s):  
Novlene Zoghlami ◽  
Zied Lachiri

This paper describes a new speech enhancement approach using perceptually based noise reduction. The proposed approach is based on the application of two perceptual filtering models to noisy speech signals: the gammatone and the gammachirp filter banks with nonlinear resolution according to the equivalent rectangular bandwidth (ERB) scale. The perceptual filtering gives a number of subbands that are individually spectral weighted and modified according to two different noise suppression rules. The importance of an accurate noise estimate is related to the reduction of the musical noise artifacts in the processed speech that appears after classic subtractive process. In this context, we use continuous noise estimation algorithms. The performance of the proposed approach is evaluated on speech signals corrupted by real-world noises. Using objective tests based on the perceptual quality PESQ score and the quality rating of signal distortion (SIG), noise distortion (BAK) and overall quality (OVRL), and subjective test based on the quality rating of automatic speech recognition (ASR), we demonstrate that our speech enhancement approach using filter banks modeling the human auditory system outperforms the conventional spectral modification algorithms to improve quality and intelligibility of the enhanced speech signal.


2012 ◽  
Vol 48 (16) ◽  
pp. 1019-1021 ◽  
Author(s):  
D. Erro ◽  
E. Navas ◽  
I. Sainz ◽  
I. Hernaez

Sign in / Sign up

Export Citation Format

Share Document