EXPRESS: Influence of emotional prosody, content and repetition on memory recognition of speaker identity

2021 ◽  
pp. 174702182199855
Author(s):  
Hanjian Xu ◽  
Jorge L. Armony

Recognizing individuals through their voice requires listeners to form an invariant representation of the speaker’s identity, immune to episodic changes that may occur between encounters. We conducted two experiments to investigate to what extent within-speaker stimulus variability influences different behavioral indices of implicit and explicit identity recognition memory, using short sentences with semantically neutral content. In Experiment 1 we assessed how speaker recognition was affected by changes in prosody (fearful to neutral, and vice versa in a between-group design) and speech content. Results revealed that, regardless of encoding prosody, changes in prosody, independent of content, or content, when prosody was kept unchanged, led to a reduced accuracy in explicit voice recognition. In contrast, both groups exhibited the same pattern of response times (RTs) for correctly recognized speakers: faster responses to fearful than neutral stimuli, and a facilitating effect for same-content stimuli only for neutral sentences. In Experiment 2 we investigated whether an invariant representation of a speaker’s identity benefited from exposure to different exemplars varying in emotional prosody (fearful and happy) and content (Multi condition), compared to repeated presentations of a single sentence (Uni condition). We found a significant repetition priming effect (i.e., reduced RTs over repetitions of the same voice identity) only for speakers in the Uni condition during encoding, but faster RTs when correctly recognizing old speakers from the Multi, compared to the Uni, condition. Overall, our findings confirm that changes in emotional prosody and/or speech content can affect listeners’ implicit and explicit recognition of newly familiarized speakers.

2020 ◽  
Vol 64 (4) ◽  
pp. 40404-1-40404-16
Author(s):  
I.-J. Ding ◽  
C.-M. Ruan

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.


2017 ◽  
Vol 47 (2) ◽  
pp. 175-192 ◽  
Author(s):  
Stefan Stieger ◽  
Christian Kandler ◽  
Ulrich S. Tran ◽  
Jakob Pietschnig ◽  
Martin Voracek

2005 ◽  
Vol 1278 ◽  
pp. 377-380
Author(s):  
Hirokazu Bokura ◽  
Shuhei Yamaguchi ◽  
Shotai Kobayashi

2021 ◽  
pp. 97-110
Author(s):  
Aviad Shtrosberg ◽  
Jesus Villalba ◽  
Najim Dehak ◽  
Azaria Cohen ◽  
Bar Ben-Yair

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Hui Wang ◽  
Fei Gao ◽  
Yue Zhao ◽  
Li Yang ◽  
Jianjian Yue ◽  
...  

In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model.


2014 ◽  
Vol 20 (5) ◽  
pp. 525-533 ◽  
Author(s):  
Tamsyn E. Van Rheenen ◽  
Susan L. Rossell

AbstractThe ability to integrate information from different sensory channels is a vital process that serves to facilitate perceptual decoding in times of unimodal ambiguity. Despite its relevance to psychosocial functioning, multimodal integration of emotional information across facial and prosodic modes has not been addressed in bipolar disorder (BD). In light of this paucity of research we investigated multimodal processing in a BD cohort using a focused attention paradigm. Fifty BD patients and 52 healthy controls completed a task assessing the cross-modal influence of emotional prosody on facial emotion recognition across congruent and incongruent facial and prosodic conditions, where attention was directed to the facial channel. There were no differences in multi-modal integration between groups at the level of accuracy, but differences were evident at the level of response time; emotional prosody biased facial recognition latencies in the control group only, where a fourfold increase in response times was evident between congruent and incongruent conditions relative to patients. The results of this study indicate that the automatic process of integrating multimodal information from facial and prosodic sensory channels is delayed in BD. Given that interpersonal communication usually occurs in real time, these results have implications for social functioning in the disorder. (JINS, 2014, 20, 1–9)


2003 ◽  
Vol 56 (5) ◽  
pp. 779-802 ◽  
Author(s):  
David Vernon ◽  
Toby J. Lloyd-Jones

We present two experiments that examine the effects of colour transformation between study and test (from black and white to colour and vice versa, or from incorrectly coloured to correctly coloured and vice versa) on implicit and explicit measures of memory for diagnostically coloured natural objects (e.g., yellow banana). For naming and coloured-object decision (i.e., deciding whether an object is correctly coloured), there were shorter response times to correctly coloured-objects than to black-and-white and incorrectly coloured-objects. Repetition priming was equivalent for the different stimulus types. Colour transformation did not influence priming of picture naming, but for coloured-object decision priming was evident only for objects remaining the same from study to test. This was the case for both naming and coloured-object decision as study tasks. When participants were asked to consciously recognize objects that they had named or made coloured-object decisions to previously, whilst ignoring their colour, colour transformation reduced recognition efficiency. We discuss these results in terms of the flexibility of object representations that mediate priming and recognition.


2018 ◽  
Author(s):  
A. Pralus ◽  
L. Fornoni ◽  
R. Bouet ◽  
M. Gomot ◽  
A. Bhatara ◽  
...  

AbstractCongenital amusia is a lifelong deficit of music processing, in particular of pitch processing. Most research investigating this neurodevelopmental disorder has focused on music perception, but pitch also has a critical role for intentional and emotional prosody in speech. Two previous studies investigating amusics’ emotional prosody recognition have shown either some deficit or no deficit (compared to controls). However, these previous studies have used only long sentence stimuli, which allow for limited control over acoustic content. Here, we tested amusic individuals for emotional prosody perception in sentences and vowels. For each type of material, participants performed an emotion categorization task, followed by intensity ratings of the recognized emotion. Compared to controls, amusic individuals had similar recognition of emotion in sentences, but poorer performance in vowels, especially when distinguishing sad and neutral stimuli. These lower performances in amusics were linked with difficulties in processing pitch and spectro-temporal parameters of the vowel stimuli. For emotion intensity, neither sentence nor vowel ratings differed between participant groups, suggesting preserved implicit processing of emotional prosody in amusia. These findings can be integrated into previous data showing preserved implicit processing of pitch and emotion in amusia alongside deficits in explicit recognition tasks. They are thus further supporting the hypothesis of impaired conscious analysis of pitch and timbre in this neurodevelopmental disorder.HighlightsAmusics showed preserved emotional prosody recognition in sentencesAmusics showed a deficit for emotional prosody recognition in short voice samplesPreserved intensity ratings of emotions in amusia suggest spared implicit processes


2021 ◽  
Author(s):  
Qinghua Zhong ◽  
Ruining Dai ◽  
Han Zhang ◽  
YongSheng Zhu ◽  
Guofu Zhou

Abstract Text-independent speaker recognition is widely used in identity recognition. In order to improve the features recognition ability, a method of text-independent speaker recognition based on a deep residual network model was proposed in this paper. Firstly, the original audio was extracted with a 64-dimensional log filter bank signal features. Secondly, a deep residual network was used to extract log filter bank signal features. The deep residual network was composed of a residual network and a Convolutional Attention Statistics Pooling (CASP) layer. The CASP layer could aggregate the frame-level features from the residual network into utterance-level features. Lastly, Adaptive Curriculum Learning Loss (ACLL) classifiers was used to optimize the output of abstract features by the deep residual network, and the text-independent speaker recognition was completed by ACLL classifiers. The proposed method was applied to a large VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.


Sign in / Sign up

Export Citation Format

Share Document