scholarly journals Beijing Opera Synthesis Based on Straight Algorithm and Deep Learning

2018 ◽  
Vol 2018 ◽  
pp. 1-14
Author(s):  
XueTing Wang ◽  
Cong Jin ◽  
Wei Zhao

Speech synthesis is an important research content in the field of human-computer interaction and has a wide range of applications. As one of its branches, singing synthesis plays an important role. Beijing Opera is a famous traditional Chinese opera, and it is called Chinese quintessence. The singing of Beijing Opera carries some features of speech but it has its own unique pronunciation rules and rhythms which differ from ordinary speech and singing. In this paper, we propose three models for the synthesis of Beijing Opera. Firstly, the speech signals of the source speaker and the target speaker are extracted by using the straight algorithm. And then through the training of GMM, we complete the voice control model to input the voice to be converted and output the voice after the voice conversion. Finally, by modeling the fundamental frequency, duration, and frequency separately, a melodic control model is constructed using GAN to realize the synthesis of the Beijing Opera fragment. We connect the fragments and superimpose the background music to achieve the synthesis of Beijing Opera. The experimental results show that the synthesized Beijing Opera has some audibility and can basically complete the composition of Beijing Opera. We also extend our models to human-AI cooperative music generation: given a target voice of human, we can generate a Beijing Opera which is sung by a new target voice.

Author(s):  
Fangkun Liu ◽  
Hui Wang ◽  
Renhua Peng ◽  
Chengshi Zheng ◽  
Xiaodong Li

AbstractVoice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.


2022 ◽  
pp. 61-77
Author(s):  
Jie Lien ◽  
Md Abdullah Al Momin ◽  
Xu Yuan

Voice assistant systems (e.g., Siri, Alexa) have attracted wide research attention. However, such systems could receive voice information from malicious sources. Recent work has demonstrated that the voice authentication system is vulnerable to different types of attacks. The attacks are categorized into two main types: spoofing attacks and hidden voice commands. In this chapter, how to launch and defend such attacks is explored. For the spoofing attack, there are four main types, such as replay attacks, impersonation attacks, speech synthesis attacks, and voice conversion attacks. Although such attacks could be accurate on the speech recognition system, they could be easily identified by humans. Thus, the hidden voice commands have attracted a lot of research interest in recent years.


2020 ◽  
Vol 5 (3) ◽  
pp. 229-233
Author(s):  
Olaide Ayodeji Agbolade

This research presents a neural network based voice conversion model. While it is a known fact that voiced sounds and prosody are the most important component of the voice conversion framework, what is not known is their objective contributions particularly in a noisy and uncontrolled environment. This model uses a 3 layer feedforward neural network to map the Linear prediction analysis coefficients of a source speaker to the acoustic vector space of the target speaker with a view to objectively determine the contributions of the voiced, unvoiced and supra-segmental components of sounds to the voice conversion model. Results showed that vowels “a”, “i”, “o” have the most significant contribution in the conversion success. The voiceless sounds were also found to be most affected by the noisy training data. An average noise level of 40 dB above the noise floor were found to degrade the voice conversion success by 55.14 percent relative to the voiced sounds. The result also show that for cross-gender voice conversion, prosody conversion is more significant in scenarios where a female is the target speaker.


2020 ◽  
Vol 26 (4) ◽  
pp. 434-453
Author(s):  
Milan Sečujski ◽  
Darko Pekar ◽  
Siniša Suzić ◽  
Anton Smirnov ◽  
Tijana Nosek

The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the similarities and differences between speakers and speaking styles more efficiently. The initial model from which speaker/style adaptation was carried out was a multi-speaker/multi-style model based on 8.5 hours of American English speech data which corresponds to 16 different speaker/style combinations. The results of the experiments show that both versions of the obtained system, one using 10 minutes and the other as little as 30 seconds of target data, outperform the state of the art in parametric speaker/style-dependent speech synthesis. This opens a wide range of application of speaker/style dependent speech synthesis based on small quantities of training data, in domains ranging from customer interaction in call centers to robot-assisted medical therapy.


2014 ◽  
Vol 530-531 ◽  
pp. 1112-1118
Author(s):  
Ye Fen Yang ◽  
Jun Zhang ◽  
Dong Hai Zeng

A design program of remote voice control system is presented based on the intelligent home on Android mobile phone platform. Via the voice recognition of Android mobile phone, the intelligent home can have a remote voice control function by this program, which greatly improves the security requirements of the intelligent home. This system is tested and proved its real-time, effectiveness and stability. Meanwhile, it can also provide a practical reference solution for human-computer interaction, having a wide range of application.


Author(s):  
Songxiang Liu ◽  
Jinghua Zhong ◽  
Lifa Sun ◽  
Xixin Wu ◽  
Xunying Liu ◽  
...  

2007 ◽  
Vol 122 (1) ◽  
pp. 46-51 ◽  
Author(s):  
I N Steen ◽  
K MacKenzie ◽  
P N Carding ◽  
A Webb ◽  
I J Deary ◽  
...  

AbstractObjectives:A wide range of well validated instruments is now available to assess voice quality and voice-related quality of life, but comparative studies of the responsiveness to change of these measures are lacking. The aim of this study was to assess the responsiveness to change of a range of different measures, following voice therapy and surgery.Design:Longitudinal, cohort comparison study.Setting:Two UK voice clinics.Participants:One hundred and forty-four patients referred for treatment of benign voice disorders, 90 undergoing voice therapy and 54 undergoing laryngeal microsurgery.Main outcome measures:Three measures of self-reported voice quality (the vocal performance questionnaire, the voice handicap index and the voice symptom scale), plus the short form 36 (SF 36) general health status measure and the hospital anxiety and depression score. Perceptual, observer-rated analysis of voice quality was performed using the grade–roughness–breathiness–asthenia–strain scale. We compared the effect sizes (i.e. responsiveness to change) of the principal subscales of all measures before and after voice therapy or phonosurgery.Results:All three self-reported voice measures had large effect sizes following either voice therapy or surgery. Outcomes were similar in both treatment groups. The effect sizes for the observer-rated grade–roughness–breathiness–asthenia–strain scale scores were smaller, although still moderate. The roughness subscale in particular showed little change after therapy or surgery. Only small effects were observed in general health and mood measures.Conclusion:The results suggest that the use of a voice-specific questionnaire is essential for assessing the effectiveness of voice interventions. All three self-reported measures tested were capable of detecting change, and scores were highly correlated. On the basis of this evaluation of different measures' sensitivities to change, there is no strong evidence to favour either the vocal performance questionnaire, the voice handicap index or the voice symptom scale.


2020 ◽  
Author(s):  
Wen-Chin Huang ◽  
Tomoki Hayashi ◽  
Shinji Watanabe ◽  
Tomoki Toda
Keyword(s):  

2021 ◽  
Vol 336 ◽  
pp. 06015
Author(s):  
Guangwei Li ◽  
Shuxue Ding ◽  
Yujie Li ◽  
Kangkang Zhang

Music is closely related to human life and is an important way for people to express their feelings in life. Deep neural networks have played a significant role in the field of music processing. There are many different neural network models to implement deep learning for audio processing. For general neural networks, there are problems such as complex operation and slow computing speed. In this paper, we introduce Long Short-Term Memory (LSTM), which is a circulating neural network, to realize end-to-end training. The network structure is simple and can generate better audio sequences after the training model. After music generation, human voice conversion is important for music understanding and inserting lyrics to pure music. We propose the audio segmentation technology for segmenting the fixed length of the human voice. Different notes are classified through piano music without considering the scale and are correlated with the different human voices we get. Finally, through the transformation, we can express the generated piano music through the output of the human voice. Experimental results demonstrate that the proposed scheme can successfully obtain a human voice from pure piano Music generated by LSTM.


Sign in / Sign up

Export Citation Format

Share Document