target speaker
Recently Published Documents


TOTAL DOCUMENTS

87
(FIVE YEARS 57)

H-INDEX

9
(FIVE YEARS 4)

2022 ◽  
Vol 19 (3) ◽  
pp. 2671-2699
Author(s):  
Huan Rong ◽  
◽  
Tinghuai Ma ◽  
Xinyu Cao ◽  
Xin Yu ◽  
...  

<abstract> <p>With the rapid development of online social networks, text-communication has become an indispensable part of daily life. Mining the emotion hidden behind the conversation-text is of prime significance and application value when it comes to the government public-opinion supervision, enterprise decision-making, etc. Therefore, in this paper, we propose a text emotion prediction model in a multi-participant text-conversation scenario, which aims to effectively predict the emotion of the text to be posted by target speaker in the future. Specifically, first, an <italic>affective space mapping</italic> is constructed, which represents the original conversation-text as an n-dimensional <italic>affective vector</italic> so as to obtain the text representation on different emotion categories. Second, a similar scene search mechanism is adopted to seek several sub-sequences which contain similar tendency on emotion shift to that of the current conversation scene. Finally, the text emotion prediction model is constructed in a two-layer encoder-decoder structure with the emotion fusion and hybrid attention mechanism introduced at the encoder and decoder side respectively. According to the experimental results, our proposed model can achieve an overall best performance on emotion prediction due to the auxiliary features extracted from similar scenes and the adoption of emotion fusion as well as the hybrid attention mechanism. At the same time, the prediction efficiency can still be controlled at an acceptable level.</p> </abstract>


2021 ◽  
pp. 101327
Author(s):  
Jiatong Shi ◽  
Chunlei Zhang ◽  
Chao Weng ◽  
Shinji Watanabe ◽  
Meng Yu ◽  
...  

Author(s):  
Fangkun Liu ◽  
Hui Wang ◽  
Renhua Peng ◽  
Chengshi Zheng ◽  
Xiaodong Li

AbstractVoice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.


2021 ◽  
Author(s):  
Chandra Leon Haider ◽  
Nina Suess ◽  
Anne Hauswald ◽  
Hyojin Park ◽  
Nathan Weisz

AbstractFace masks have become a prevalent measure during the Covid-19 pandemic to counteract the transmission of SARS-CoV 2. An unintended “side-effect” of face masks is their adverse influence on speech perception especially in challenging listening situations. So far, behavioural studies have not pinpointed exactly which feature(s) of speech processing face masks affect in such listening situations. We conducted an audiovisual (AV) multi-speaker experiment using naturalistic speech (i.e. an audiobook). In half of the trials, the target speaker wore a (surgical) face mask, while we measured the brain activity of normal hearing participants via magnetoencephalography (MEG). A decoding model on the clear AV speech (i.e. no additional speaker and target speaker not wearing a face mask) was trained and used to reconstruct crucial speech features in each condition. We found significant main effects of face masks on the reconstruction of acoustic features, such as the speech envelope and spectral speech features (i.e. pitch and formant frequencies), while reconstruction of higher level features of speech segmentation (phoneme and word onsets) were especially impaired through masks in difficult listening situations, i.e. when a distracting speaker was also presented. Our findings demonstrate the detrimental impact face masks have on listening and speech perception, thus extending previous behavioural results. Supporting the idea of visual facilitation of speech is the fact that we used surgical face masks in our study, which only show mild effects on speech acoustics. This idea is in line with recent research, also by our group, showing that visual cortical regions track spectral modulations. Since hearing impairment usually affects higher frequencies, the detrimental effect of face masks might pose a particular challenge for individuals who likely need the visual information about higher frequencies (e.g. formants) to compensate.


Algorithms ◽  
2021 ◽  
Vol 14 (10) ◽  
pp. 287
Author(s):  
Ali Aroudi ◽  
Eghart Fischer ◽  
Maja Serman ◽  
Henning Puder ◽  
Simon Doclo

Recent advances have shown that it is possible to identify the target speaker which a listener is attending to using single-trial EEG-based auditory attention decoding (AAD). Most AAD methods have been investigated for an open-loop scenario, where AAD is performed in an offline fashion without presenting online feedback to the listener. In this work, we aim at developing a closed-loop AAD system that allows to enhance a target speaker, suppress an interfering speaker and switch attention between both speakers. To this end, we propose a cognitive-driven adaptive gain controller (AGC) based on real-time AAD. Using the EEG responses of the listener and the speech signals of both speakers, the real-time AAD generates probabilistic attention measures, based on which the attended and the unattended speaker are identified. The AGC then amplifies the identified attended speaker and attenuates the identified unattended speaker, which are presented to the listener via loudspeakers. We investigate the performance of the proposed system in terms of the decoding performance and the signal-to-interference ratio (SIR) improvement. The experimental results show that, although there is a significant delay to detect attention switches, the proposed system is able to improve the SIR between the attended and the unattended speaker. In addition, no significant difference in decoding performance is observed between closed-loop AAD and open-loop AAD. The subjective evaluation results show that the proposed closed-loop cognitive-driven system demands a similar level of cognitive effort to follow the attended speaker, to ignore the unattended speaker and to switch attention between both speakers compared to using open-loop AAD. Closed-loop AAD in an online fashion is feasible and enables the listener to interact with the AGC.


2021 ◽  
Vol 11 (16) ◽  
pp. 7489
Author(s):  
Mohammed Salah Al-Radhi ◽  
Tamás Gábor Csapó ◽  
Géza Németh

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.


Signals ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 456-474
Author(s):  
Al-Waled Al-Dulaimi ◽  
Todd K. Moon ◽  
Jacob H. Gunther

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping algorithm in conjunction with an artificial neural network. An outer warping process which temporally aligns blocks of speech (dynamic time warp, DTW) invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp, DFW). The mapping function produced by inner dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Artifacts arising from this amplitude spectral mapping are reduced by reconstructing phase information. Information obtained by this process is used to train an artificial neural network to produce spectral warping information based on spectral input data. The performance of the speech mapping compared using Mel-Cepstral Distortion (MCD) with previous voice transformation research, and it is shown to perform better than other methods, based on their reported MCD scores.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254119
Author(s):  
Jordan A. Drew ◽  
W. Owen Brimijoin

Those experiencing hearing loss face severe challenges in perceiving speech in noisy situations such as a busy restaurant or cafe. There are many factors contributing to this deficit including decreased audibility, reduced frequency resolution, and decline in temporal synchrony across the auditory system. Some hearing assistive devices implement beamforming in which multiple microphones are used in combination to attenuate surrounding noise while the target speaker is left unattenuated. In increasingly challenging auditory environments, more complex beamforming algorithms are required, which increases the processing time needed to provide a useful signal-to-noise ratio of the target speech. This study investigated whether the benefits from signal enhancement from beamforming are outweighed by the negative impacts on perception from an increase in latency between the direct acoustic signal and the digitally enhanced signal. The hypothesis for this study is that an increase in latency between the two identical speech signals would decrease intelligibility of the speech signal. Using 3 gain / latency pairs from a beamforming simulation previously completed in lab, perceptual thresholds of SNR from a simulated use case were obtained from normal hearing participants. No significant differences were detected between the 3 conditions. When presented with 2 copies of the same speech signal presented at varying gain / latency pairs in a noisy environment, any negative intelligibility effects from latency are masked by the noise. These results allow for more lenient restrictions for limiting processing delays in hearing assistive devices.


Sign in / Sign up

Export Citation Format

Share Document