voice conversion
Recently Published Documents


TOTAL DOCUMENTS

887
(FIVE YEARS 291)

H-INDEX

36
(FIVE YEARS 7)

2022 ◽  
pp. 61-77
Author(s):  
Jie Lien ◽  
Md Abdullah Al Momin ◽  
Xu Yuan

Voice assistant systems (e.g., Siri, Alexa) have attracted wide research attention. However, such systems could receive voice information from malicious sources. Recent work has demonstrated that the voice authentication system is vulnerable to different types of attacks. The attacks are categorized into two main types: spoofing attacks and hidden voice commands. In this chapter, how to launch and defend such attacks is explored. For the spoofing attack, there are four main types, such as replay attacks, impersonation attacks, speech synthesis attacks, and voice conversion attacks. Although such attacks could be accurate on the speech recognition system, they could be easily identified by humans. Thus, the hidden voice commands have attracted a lot of research interest in recent years.


2022 ◽  
Vol 70 (2) ◽  
pp. 4027-4051
Author(s):  
Palli Padmini ◽  
C. Paramasivam ◽  
G. Jyothish Lal ◽  
Sadeen Alharbi ◽  
Kaustav Bhowmick
Keyword(s):  

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Mourad Talbi ◽  
Med Salim Bouhlel

Speech enhancement has gained considerable attention in the employment of speech transmission via the communication channel, speaker identification, speech-based biometric systems, video conference, hearing aids, mobile phones, voice conversion, microphones, and so on. The background noise processing is needed for designing a successful speech enhancement system. In this work, a new speech enhancement technique based on Stationary Bionic Wavelet Transform (SBWT) and Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude is proposed. This technique consists at the first step in applying the SBWT to the noisy speech signal, in order to obtain eight noisy wavelet coefficients. The denoising of each of those coefficients is performed through the application of the denoising method based on MMSE Estimate of Spectral Amplitude. The SBWT inverse, S B W T − 1 , is applied to the obtained denoised stationary wavelet coefficients for finally obtaining the enhanced speech signal. The proposed technique’s performance is proved by the calculation of the Signal to Noise Ratio (SNR), the Segmental SNR (SSNR), and the Perceptual Evaluation of Speech Quality (PESQ).


Author(s):  
Kun Zhou ◽  
Berrak Sisman ◽  
Rui Liu ◽  
Haizhou Li
Keyword(s):  

2021 ◽  
Vol 141 (12) ◽  
pp. 1267-1268
Author(s):  
Hiroki Nonoyama ◽  
Chifumi Suzuki ◽  
Takanori Nishino

Author(s):  
Wei-Zhong Zheng ◽  
Ji-Yan Han ◽  
Chen-Kai Lee ◽  
Yu-Yi Lin ◽  
Shu-Han Chang ◽  
...  

Author(s):  
Fangkun Liu ◽  
Hui Wang ◽  
Renhua Peng ◽  
Chengshi Zheng ◽  
Xiaodong Li

AbstractVoice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.


Author(s):  
Aakshi Mittal ◽  
Mohit Dua

AbstractDetection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.


Sign in / Sign up

Export Citation Format

Share Document