EXPRESS: Influence of emotional prosody, content and repetition on memory recognition of speaker identity

Quarterly Journal of Experimental Psychology ◽

10.1177/1747021821998557 ◽

2021 ◽

pp. 174702182199855

Author(s):

Hanjian Xu ◽

Jorge L. Armony

Keyword(s):

Speaker Recognition ◽

Response Times ◽

Emotional Prosody ◽

Invariant Representation ◽

Identity Recognition ◽

Group Design ◽

Explicit Recognition ◽

Implicit And Explicit ◽

Repetition Priming Effect ◽

Speech Content

Recognizing individuals through their voice requires listeners to form an invariant representation of the speaker’s identity, immune to episodic changes that may occur between encounters. We conducted two experiments to investigate to what extent within-speaker stimulus variability influences different behavioral indices of implicit and explicit identity recognition memory, using short sentences with semantically neutral content. In Experiment 1 we assessed how speaker recognition was affected by changes in prosody (fearful to neutral, and vice versa in a between-group design) and speech content. Results revealed that, regardless of encoding prosody, changes in prosody, independent of content, or content, when prosody was kept unchanged, led to a reduced accuracy in explicit voice recognition. In contrast, both groups exhibited the same pattern of response times (RTs) for correctly recognized speakers: faster responses to fearful than neutral stimuli, and a facilitating effect for same-content stimuli only for neutral sentences. In Experiment 2 we investigated whether an invariant representation of a speaker’s identity benefited from exposure to different exemplars varying in emotional prosody (fearful and happy) and content (Multi condition), compared to repeated presentations of a single sentence (Uni condition). We found a significant repetition priming effect (i.e., reduced RTs over repetitions of the same voice identity) only for speakers in the Uni condition during encoding, but faster RTs when correctly recognizing old speakers from the Multi, compared to the Uni, condition. Overall, our findings confirm that changes in emotional prosody and/or speech content can affect listeners’ implicit and explicit recognition of newly familiarized speakers.

Download Full-text

Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.4.040404 ◽

2020 ◽

Vol 64 (4) ◽

pp. 40404-1-40404-16

Author(s):

I.-J. Ding ◽

C.-M. Ruan

Keyword(s):

Face Detection ◽

Speaker Recognition ◽

Visual Information ◽

Classification Tree ◽

Gaussian Mixture ◽

Recognition Method ◽

Indoor Space ◽

Identity Recognition ◽

Visual Identity ◽

Speaker Classification

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

Download Full-text

Genetic and Environmental Sources of Implicit and Explicit Self-Esteem and Affect: Results from a Genetically Sensitive Multi-group Design

Behavior Genetics ◽

10.1007/s10519-016-9829-8 ◽

2017 ◽

Vol 47 (2) ◽

pp. 175-192 ◽

Cited By ~ 4

Author(s):

Stefan Stieger ◽

Christian Kandler ◽

Ulrich S. Tran ◽

Jakob Pietschnig ◽

Martin Voracek

Keyword(s):

Self Esteem ◽

Group Design ◽

Implicit And Explicit

Download Full-text

Dissociated process for implicit and explicit recognition of novel events

International Congress Series ◽

10.1016/j.ics.2004.11.054 ◽

2005 ◽

Vol 1278 ◽

pp. 377-380

Author(s):

Hirokazu Bokura ◽

Shuhei Yamaguchi ◽

Shotai Kobayashi

Keyword(s):

Explicit Recognition ◽

Implicit And Explicit

Download Full-text

Invariant Representation Learning for Robust Far-Field Speaker Recognition

10.1007/978-3-030-89579-2_9 ◽

2021 ◽

pp. 97-110

Author(s):

Aviad Shtrosberg ◽

Jesus Villalba ◽

Najim Dehak ◽

Azaria Cohen ◽

Bar Ben-Yair

Keyword(s):

Speaker Recognition ◽

Representation Learning ◽

Far Field ◽

Invariant Representation

Download Full-text

Multitask Learning with Local Attention for Tibetan Speech Recognition

Complexity ◽

10.1155/2020/8894566 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Hui Wang ◽

Fei Gao ◽

Yue Zhao ◽

Li Yang ◽

Jianjian Yue ◽

...

Keyword(s):

Speech Recognition ◽

Speaker Recognition ◽

Multitask Learning ◽

Experimental Results ◽

Context Information ◽

Accuracy Rate ◽

Baseline Model ◽

Content Recognition ◽

Tibetan Dialects ◽

Speech Content

In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model.

Download Full-text

Multimodal Emotion Integration in Bipolar Disorder: An Investigation of Involuntary Cross-Modal Influences between Facial and Prosodic Channels

Journal of the International Neuropsychological Society ◽

10.1017/s1355617714000253 ◽

2014 ◽

Vol 20 (5) ◽

pp. 525-533 ◽

Cited By ~ 6

Author(s):

Tamsyn E. Van Rheenen ◽

Susan L. Rossell

Keyword(s):

Bipolar Disorder ◽

Interpersonal Communication ◽

Psychosocial Functioning ◽

Response Times ◽

Control Group ◽

Automatic Process ◽

Emotional Prosody ◽

Emotional Information ◽

Fourfold Increase ◽

Multimodal Information

AbstractThe ability to integrate information from different sensory channels is a vital process that serves to facilitate perceptual decoding in times of unimodal ambiguity. Despite its relevance to psychosocial functioning, multimodal integration of emotional information across facial and prosodic modes has not been addressed in bipolar disorder (BD). In light of this paucity of research we investigated multimodal processing in a BD cohort using a focused attention paradigm. Fifty BD patients and 52 healthy controls completed a task assessing the cross-modal influence of emotional prosody on facial emotion recognition across congruent and incongruent facial and prosodic conditions, where attention was directed to the facial channel. There were no differences in multi-modal integration between groups at the level of accuracy, but differences were evident at the level of response time; emotional prosody biased facial recognition latencies in the control group only, where a fourfold increase in response times was evident between congruent and incongruent conditions relative to patients. The results of this study indicate that the automatic process of integrating multimodal information from facial and prosodic sensory channels is delayed in BD. Given that interpersonal communication usually occurs in real time, these results have implications for social functioning in the disorder. (JINS, 2014, 20, 1–9)

Download Full-text

The Role of Colour in Implicit and Explicit Memory Performance

The Quarterly Journal of Experimental Psychology Section A ◽

10.1080/02724980244000684 ◽

2003 ◽

Vol 56 (5) ◽

pp. 779-802 ◽

Cited By ~ 19

Author(s):

David Vernon ◽

Toby J. Lloyd-Jones

Keyword(s):

Response Times ◽

Memory Performance ◽

Object Representations ◽

Black And White ◽

Object Decision ◽

Implicit And Explicit Memory ◽

Recognition Efficiency ◽

Implicit And Explicit ◽

Explicit Measures

We present two experiments that examine the effects of colour transformation between study and test (from black and white to colour and vice versa, or from incorrectly coloured to correctly coloured and vice versa) on implicit and explicit measures of memory for diagnostically coloured natural objects (e.g., yellow banana). For naming and coloured-object decision (i.e., deciding whether an object is correctly coloured), there were shorter response times to correctly coloured-objects than to black-and-white and incorrectly coloured-objects. Repetition priming was equivalent for the different stimulus types. Colour transformation did not influence priming of picture naming, but for coloured-object decision priming was evident only for objects remaining the same from study to test. This was the case for both naming and coloured-object decision as study tasks. When participants were asked to consciously recognize objects that they had named or made coloured-object decisions to previously, whilst ignoring their colour, colour transformation reduced recognition efficiency. We discuss these results in terms of the flexibility of object representations that mediate priming and recognition.

Download Full-text

The Study on Implicit and Explicit Recognition of Practical Knowledge Orientation according to Background Variables of the Hearing Impaired School Teachers and Analysis of Relationship between Practical Knowledge Orientation and Curriculum and Teaching-Learning Factors

JOURNAL OF SPECIAL EDUCATION & REHABILITATION SCIENCE ◽

10.23944/jsers.2017.12.56.4.13 ◽

2017 ◽

Vol 56 (4) ◽

pp. 283-307

Author(s):

Sung Kyu Choi ◽

Jung Gyu Kim

Keyword(s):

Practical Knowledge ◽

Hearing Impaired ◽

School Teachers ◽

Explicit Recognition ◽

Teaching Learning ◽

Learning Factors ◽

Implicit And Explicit

Download Full-text

Emotional prosody in congenital amusia: impaired and spared processes

10.1101/466748 ◽

2018 ◽

Author(s):

A. Pralus ◽

L. Fornoni ◽

R. Bouet ◽

M. Gomot ◽

A. Bhatara ◽

...

Keyword(s):

Music Perception ◽

Neurodevelopmental Disorder ◽

Critical Role ◽

List Type ◽

Emotional Prosody ◽

Implicit Processing ◽

Implicit Processes ◽

Pitch Processing ◽

Congenital Amusia ◽

Explicit Recognition

AbstractCongenital amusia is a lifelong deficit of music processing, in particular of pitch processing. Most research investigating this neurodevelopmental disorder has focused on music perception, but pitch also has a critical role for intentional and emotional prosody in speech. Two previous studies investigating amusics’ emotional prosody recognition have shown either some deficit or no deficit (compared to controls). However, these previous studies have used only long sentence stimuli, which allow for limited control over acoustic content. Here, we tested amusic individuals for emotional prosody perception in sentences and vowels. For each type of material, participants performed an emotion categorization task, followed by intensity ratings of the recognized emotion. Compared to controls, amusic individuals had similar recognition of emotion in sentences, but poorer performance in vowels, especially when distinguishing sad and neutral stimuli. These lower performances in amusics were linked with difficulties in processing pitch and spectro-temporal parameters of the vowel stimuli. For emotion intensity, neither sentence nor vowel ratings differed between participant groups, suggesting preserved implicit processing of emotional prosody in amusia. These findings can be integrated into previous data showing preserved implicit processing of pitch and emotion in amusia alongside deficits in explicit recognition tasks. They are thus further supporting the hypothesis of impaired conscious analysis of pitch and timbre in this neurodevelopmental disorder.HighlightsAmusics showed preserved emotional prosody recognition in sentencesAmusics showed a deficit for emotional prosody recognition in short voice samplesPreserved intensity ratings of emotions in amusia suggest spared implicit processes

Download Full-text

Text-Independent Speaker Recognition Based on Adaptive Course Learning Loss and Deep Residual Network

10.21203/rs.3.rs-206450/v1 ◽

2021 ◽

Author(s):

Qinghua Zhong ◽

Ruining Dai ◽

Han Zhang ◽

YongSheng Zhu ◽

Guofu Zhou

Keyword(s):

Error Rate ◽

Speaker Recognition ◽

Filter Bank ◽

Equal Error Rate ◽

Residual Network ◽

Learning Loss ◽

Test Dataset ◽

Identity Recognition ◽

Signal Features ◽

Recognition Ability

Abstract Text-independent speaker recognition is widely used in identity recognition. In order to improve the features recognition ability, a method of text-independent speaker recognition based on a deep residual network model was proposed in this paper. Firstly, the original audio was extracted with a 64-dimensional log filter bank signal features. Secondly, a deep residual network was used to extract log filter bank signal features. The deep residual network was composed of a residual network and a Convolutional Attention Statistics Pooling (CASP) layer. The CASP layer could aggregate the frame-level features from the residual network into utterance-level features. Lastly, Adaptive Curriculum Learning Loss (ACLL) classifiers was used to optimize the output of abstract features by the deep residual network, and the text-independent speaker recognition was completed by ACLL classifiers. The proposed method was applied to a large VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.

Download Full-text