speech database
Recently Published Documents


TOTAL DOCUMENTS

265
(FIVE YEARS 50)

H-INDEX

14
(FIVE YEARS 2)

Data ◽  
2021 ◽  
Vol 6 (12) ◽  
pp. 130
Author(s):  
Mathilde Marie Duville ◽  
Luz María Alonso-Valerdi ◽  
David I. Ibarra-Zarate

In this paper, the Mexican Emotional Speech Database (MESD) that contains single-word emotional utterances for anger, disgust, fear, happiness, neutral and sadness with adult (male and female) and child voices is described. To validate the emotional prosody of the uttered words, a cubic Support Vector Machines classifier was trained on the basis of prosodic, spectral and voice quality features for each case study: (1) male adult, (2) female adult and (3) child. In addition, cultural, semantic, and linguistic shaping of emotional expression was assessed by statistical analysis. This study was registered at BioMed Central and is part of the implementation of a published study protocol. Mean emotional classification accuracies yielded 93.3%, 89.4% and 83.3% for male, female and child utterances respectively. Statistical analysis emphasized the shaping of emotional prosodies by semantic and linguistic features. A cultural variation in emotional expression was highlighted by comparing the MESD with the INTERFACE for Castilian Spanish database. The MESD provides reliable content for linguistic emotional prosody shaped by the Mexican cultural environment. In order to facilitate further investigations, a corpus controlled for linguistic features and emotional semantics, as well as one containing words repeated across voices and emotions are provided. The MESD is made freely available.


Author(s):  
Tiankai Zhi ◽  
Ying Shi ◽  
Wenqiang Du ◽  
Guanyu Li ◽  
Dong Wang
Keyword(s):  

2021 ◽  
Vol 11 (2) ◽  
pp. 35-41
Author(s):  
Thurgeaswary Rokanatnam ◽  
Hazinah Kutty Mammi

Speaker recognition is an ability to identify speaker’s characteristics based from spoken language. The purpose of this study is to identify gender of speakers based on audio recordings. The objective of this study is to evaluate the accuracy rate of this technique to differentiate the gender and also to determine the performance rate to classify even when using self-acquired recordings. Audio forensics uses voice recordings as part of evidence to solve cases. This study is mainly conducted to provide an easier technique to identify the unknown speaker characteristics in forensic field. This experiment is fulfilled by training the pattern classifier using gender dependent data. In order to train the model, a speech database is obtained from an online speech corpus comprising of both male and female speakers. During the testing phase, apart from the data from speech corpus, audio recordings of UTM students will too be used to determine the accuracy rate of this speaker identification experiment. As for the technique to run this experiment, Mel Frequency Cepstrum Coefficient (MFCC) algorithm is used to extract the features from speech data while Gaussian Mixture Model (GMM) is used to model the gender identifier. Noise removal was not used for any speech data in this experiment. Python software is used to extract using MFCC coefficients and model the behavior using GMM technique. Experiment results show that GMM-MFCC technique can identify gender regardless of language but with varying accuracy rate.


Author(s):  
Sonal Anilkumar Tiwari

Abstract: This can be quite interesting when we think that we commanding something to in-animated objects. Yes it is possible with the help of ASR systems. Speech recognition system is a system that can make humans to talk with machineries. Nowadays speech recognition is such a technique that without it, a person cannot do any of his work properly. People get addicted of it. And it has become a habit for humans like we use mobile phones but when we want to type something, then we immediately can pass the voice commands. With which our Efforts are reduced, as well as a lot of our time. Keywords: Speech, Speech Recognition, ASR, Corpus, PRAAT


Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6503
Author(s):  
Luis Carlos Sarmiento ◽  
Sergio Villamizar ◽  
Omar López ◽  
Ana Claros Collazos ◽  
Jhon Sarmiento ◽  
...  

The use of imagined speech with electroencephalographic (EEG) signals is a promising field of brain-computer interfaces (BCI) that seeks communication between areas of the cerebral cortex related to language and devices or machines. However, the complexity of this brain process makes the analysis and classification of this type of signals a relevant topic of research. The goals of this study were: to develop a new algorithm based on Deep Learning (DL), referred to as CNNeeg1-1, to recognize EEG signals in imagined vowel tasks; to create an imagined speech database with 50 subjects specialized in imagined vowels from the Spanish language (/a/,/e/,/i/,/o/,/u/); and to contrast the performance of the CNNeeg1-1 algorithm with the DL Shallow CNN and EEGNet benchmark algorithms using an open access database (BD1) and the newly developed database (BD2). In this study, a mixed variance analysis of variance was conducted to assess the intra-subject and inter-subject training of the proposed algorithms. The results show that for intra-subject training analysis, the best performance among the Shallow CNN, EEGNet, and CNNeeg1-1 methods in classifying imagined vowels (/a/,/e/,/i/,/o/,/u/) was exhibited by CNNeeg1-1, with an accuracy of 65.62% for BD1 database and 85.66% for BD2 database.


2021 ◽  
Author(s):  
Mahadeva Swamy ◽  
D J Ravi

Abstract An ASR system is built for the Continuous Kannada Speech Recognition. The acoustic and language models are created with the help of the Kaldi toolkit. The speech database is created with the native male and female Kannada speakers. The 75% of collected speech data is used for training the acoustic models and 25% of speech database is used for the system testing. The Performance of the system is presented interms of Word Error Rate (WER). Wavelet Packet Decomposition along with Mel filter bank is used to achieve feature extraction. The proposed feature extraction performs slightly better than the conventional features such as MFCC, PLP interms of WRA and WER under uncontrolled conditions. For the speech corpus collected in Kannada Language, the proposed features shows an improvement in WRA of 1.79% over baseline features.


2021 ◽  
Author(s):  
Jia Fu ◽  
Sen Yang ◽  
Fei He ◽  
Ling He ◽  
Yuanyuan Li ◽  
...  

Abstract Background: Schizophrenia is a chronic and severe mental disease, which largely influences the daily life and work of patients. In the clinic, schizophrenia with negative symptoms is usually misdiagnosed and hardly treated. The diagnosis is also dependent on the experience of clinicians. It is urgent to develop an objective and effective method to diagnose schizophrenia with negative symptoms. Recent studies had shown that impaired speech could be considered as an indicator to diagnose schizophrenia. The literature about schizophrenia speech detection was mainly based on feature engineering, in which effective feature extraction is difficult because of the variability of speech signals. Methods: A novel deep learning architecture based on a convolutional neural network, termed Sch-net, is designed for end-to-end schizophrenia speech detection in this work. It avoids the procedure of artificial feature extraction and combines the advantages of skip connections and attention mechanism to discriminate schizophrenia patients and controls. Results: We validate our Sch-net through ablation experiments on a schizophrenia speech dataset that contains 28 schizophrenia patients and 28 healthy controls. The comparisons with the models based on feature engineering and classic deep neural networks are also conducted. The experimental results show that the Sch-net has a great performance on schizophrenia speech detection task, which can achieve 97.76% accuracy on the schizophrenia speech dataset. To further verify the generalization of our model, the Sch-net is tested on open access LANNA children speech database for specific language impairment detection. Our code is available at https://github.com/Scu-sen/Sch-net. Conclusions: Extensive experiments show that the proposed Sch-net can provide the aided information for the diagnosis of schizophrenia speech and specific language impairment.


Author(s):  
Elena Lyakso ◽  
◽  
Olga Frolova ◽  
Aleksandr Nikolaev

"The study of the peculiarities of speech of children with atypical development is necessary for the development of educational programs, children’s socialization and adaptation in society. The aim of this study is to determine the acoustic features of voice and speech of children with autism spectrum disorders (ASD) as a possible additional diagnostic criterion. The multiplicity of symptomatology, different age of its manifestation, and the presence of a leading symptom complex individually for each child make it difficult to diagnose ASD. To determine the specificity of speech features of ASD, we analyzed the speech of children with developmental disabilities in which speech disorders accompany the disease - Down syndrome (DS), intellectual disabilities (ID), mixed specific developmental disorders (MDD). The features that reflect the main physiological processes occurring in the speech tract during voice and speech production are selected for analysis. The speech of 300 children aged 4-16 years was analyzed. Speech files are selected from the speech database ""AD_Child.Ru"" (Lyakso et al., 2019). Acoustic features of voice and speech, which are specific for different developmental disorders, were determined. The speech of ASD children is characterized by: high pitch values (high voice); pitch variability; high values for the third formant (emotional) and its intensity causing ""atypical"" spectrogram of the speech signal; high values of vowel articulation index (VAI). The speech of children with DS is characterized by the maximal duration of vowels in words; low pitch values (low voice); a wide range of values of the VAI depending on the difficulty of speech material; low values of the third formant; unformed most of consonant phonemes. The characteristics of speech of children with ID are: high values of vowel’s duration in words, the pitch, and the third formant, low values of the VAI; of MDD - low pitch values and high values of the VAI. Based on the identified peculiarities specific to each disease, the set of acoustic features specific to ASD can be considered as a biomarker of autism and used as an additional diagnostic criterion. This will allow a timely diagnose, appoint treatment and develop individual programs for children. Speech characteristics of children with ID, DS, and MDD can be considered to a greater extent in the training and socialization of children and used in the development of training programs taking into account individual peculiarities of children."


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1579 ◽  
Author(s):  
Kyoung Ju Noh ◽  
Chi Yoon Jeong ◽  
Jiyoun Lim ◽  
Seungeun Chung ◽  
Gague Kim ◽  
...  

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.


Sign in / Sign up

Export Citation Format

Share Document