speech processing
Recently Published Documents





2022 ◽  
Vol 72 ◽  
pp. 101308
Aamir Wali ◽  
Zareen Alamgir ◽  
Saira Karim ◽  
Ather Fawaz ◽  
Mubariz Barkat Ali ◽  

Muneera Altayeb ◽  
Amani Al-Ghraibah

<span>Determining and classifying pathological human sounds are still an interesting area of research in the field of speech processing. This paper explores different methods of voice features extraction, namely: Mel frequency cepstral coefficients (MFCCs), zero-crossing rate (ZCR) and discrete wavelet transform (DWT). A comparison is made between these methods in order to identify their ability in classifying any input sound as a normal or pathological voices using support vector machine (SVM). Firstly, the voice signal is processed and filtered, then vocal features are extracted using the proposed methods and finally six groups of features are used to classify the voice data as healthy, hyperkinetic dysphonia, hypokinetic dysphonia, or reflux laryngitis using separate classification processes. The classification results reach 100% accuracy using the MFCC and kurtosis feature group. While the other classification accuracies range between~60% to~97%. The Wavelet features provide very good classification results in comparison with other common voice features like MFCC and ZCR features. This paper aims to improve the diagnosis of voice disorders without the need for surgical interventions and endoscopic procedures which consumes time and burden the patients. Also, the comparison between the proposed feature extraction methods offers a good reference for further researches in the voice classification area.</span>

2022 ◽  
Vol 12 (1) ◽  
pp. 107
Martin Chavant ◽  
Zoï Kapoula

Presbycusis, physiological age-related hearing loss, is a major health problem because it is the most common cause of hearing impairment, and its impact will grow in the coming years with the aging population. Besides auditory consequences, the literature recently found an association between hearing loss and cognitive decline over the last two decades, emphasizing the importance of the early detection of presbycusis. However, the current hearing tests are not sufficient to detect presbycusis in some cases. Furthermore, the underlying mechanisms of this association are still under discussion, calling for a new field of research on that topic. In that context, this study investigates for the first time the interaction between presbycusis, eye movement latency and Stroop scores for a normal aging population. Hearing abilities, eye movement latency and the Stroop Victoria test were measured for 69 elderly (mean 66.7 ± 8.4) and 30 young (mean 25.3 ± 2.7) participants. The results indicated a significant relationship between saccade latency and speech audiometry in the silence score, independently from age. These promising results suggest common attentional mechanisms between speech processing and saccade latency. The results are discussed regarding the relationship between hearing and cognition, and regarding the perspective of expanding new tools for presbycusis diagnosis.

Chieh Kao ◽  
Maria D. Sera ◽  
Yang Zhang

Purpose: The aim of this study was to investigate infants' listening preference for emotional prosodies in spoken words and identify their acoustic correlates. Method: Forty-six 3- to-12-month-old infants ( M age = 7.6 months) completed a central fixation (or look-to-listen) paradigm in which four emotional prosodies (happy, sad, angry, and neutral) were presented. Infants' looking time to the string of words was recorded as a proxy of their listening attention. Five acoustic variables—mean fundamental frequency (F0), word duration, intensity variation, harmonics-to-noise ratio (HNR), and spectral centroid—were also analyzed to account for infants' attentiveness to each emotion. Results: Infants generally preferred affective over neutral prosody, with more listening attention to the happy and sad voices. Happy sounds with breathy voice quality (low HNR) and less brightness (low spectral centroid) maintained infants' attention more. Sad speech with shorter word duration (i.e., faster speech rate), less breathiness, and more brightness gained infants' attention more than happy speech did. Infants listened less to angry than to happy and sad prosodies, and none of the acoustic variables were associated with infants' listening interests in angry voices. Neutral words with a lower F0 attracted infants' attention more than those with a higher F0. Neither age nor sex effects were observed. Conclusions: This study provides evidence for infants' sensitivity to the prosodic patterns for the basic emotion categories in spoken words and how the acoustic properties of emotional speech may guide their attention. The results point to the need to study the interplay between early socioaffective and language development.

Vikram Ramanarayanan ◽  
Adam C. Lammert ◽  
Hannah P. Rowe ◽  
Thomas F. Quatieri ◽  
Jordan R. Green

Purpose: Over the past decade, the signal processing and machine learning literature has demonstrated notable advancements in automated speech processing with the use of artificial intelligence for medical assessment and monitoring (e.g., depression, dementia, and Parkinson's disease, among others). Meanwhile, the clinical speech literature has identified several interpretable, theoretically motivated measures that are sensitive to abnormalities in the cognitive, linguistic, affective, motoric, and anatomical domains. Both fields have, thus, independently demonstrated the potential for speech to serve as an informative biomarker for detecting different psychiatric and physiological conditions. However, despite these parallel advancements, automated speech biomarkers have not been integrated into routine clinical practice to date. Conclusions: In this article, we present opportunities and challenges for adoption of speech as a biomarker in clinical practice and research. Toward clinical acceptance and adoption of speech-based digital biomarkers, we argue for the importance of several factors such as robustness, specificity, diversity, and physiological interpretability of speech analytics in clinical applications.

Mads Midtlyng ◽  
Yuji Sato ◽  
Hiroshi Hosobe

AbstractVoice adaptation is an interactive speech processing technique that allows the speaker to transmit with a chosen target voice. We propose a novel method that is intended for dynamic scenarios, such as online video games, where the source speaker’s and target speaker’s data are nonaligned. This would yield massive improvements to immersion and experience by fully becoming a character, and address privacy concerns to protect against harassment by disguising the voice. With unaligned data, traditional methods, e.g., probabilistic models become inaccurate, while recent methods such as deep neural networks (DNN) require too substantial preparation work. Common methods require multiple subjects to be trained in parallel, which constraints practicality in productive environments. Our proposal trains a subject nonparallel into a voice profile used against any unknown source speaker. Prosodic data such as pitch, power and temporal structure are encoded into RGBA-colored frames used in a multi-objective optimization problem to adjust interrelated features based on color likeness. Finally, frames are smoothed and adjusted before output. The method was evaluated using Mean Opinion Score, ABX, MUSHRA, Single Ease Questions and performance benchmarks using two voice profiles of varying sizes and lastly discussion regarding game implementation. Results show improved adaptation quality, especially in a larger voice profile, and audience is positive about using such technology in future games.

2022 ◽  
Vol 26 ◽  
pp. 233121652110609
Benjamin Caswell-Midwinter ◽  
Elizabeth M. Doney ◽  
Meisam K. Arjmandi ◽  
Kelly N. Jahn ◽  
Barbara S. Herrmann ◽  

Cochlear implant programming typically involves measuring electrode impedance, selecting a speech processing strategy and fitting the dynamic range of electrical stimulation. This study retrospectively analyzed a clinical dataset of adult cochlear implant recipients to understand how these variables relate to speech recognition. Data from 425 implanted post-lingually deafened ears with Advanced Bionics devices were analyzed. A linear mixed-effects model was used to infer how impedance, programming and patient factors were associated with monosyllabic word recognition scores measured in quiet. Additional analyses were conducted on subsets of data to examine the role of speech processing strategy on scores, and the time taken for the scores of unilaterally implanted patients to plateau. Variation in basal impedance was negatively associated with word score, suggesting importance in evaluating the profile of impedance. While there were small, negative bivariate correlations between programming level metrics and word scores, these relationships were not clearly supported by the model that accounted for other factors. Age at implantation was negatively associated with word score, and duration of implant experience was positively associated with word score, which could help to inform candidature and guide expectations. Electrode array type was also associated with word score. Word scores measured with traditional continuous interleaved sampling and current steering speech processing strategies were similar. The word scores of unilaterally implanted patients largely plateaued within 6-months of activation. However, there was individual variation which was not related to initially measured impedance and programming levels.

Kartik Tiwari

Abstract: This paper introduces a new text-to-speech presentation from end-to-end (E2E-TTS) using toolkit called ESPnet-TTS, which is an open source extension. ESPnet speech processing tools kit. Various models come under ESPnet TTS TacoTron 2, Transformer TTS, and Fast Speech. This also provides recipes recommended by the Kaldi speech recognition tool kit (ASR). Recipes based on the composition combined with the ESPnet ASR recipe, which provides high performance. This toolkit also provides pre-trained models and samples of all recipes for users to use as a base .It works on TTS-STT and translation features for various indicator languages, with a strong focus on English, Marathi and Hindi. This paper also shows that neural sequence-to-sequence models find the state of the art or near the effects of the art state on existing databases. We also analyze some of the key design challenges that contribute to the development of a multilingual business translation system, which includes processing bilingual business data sets and evaluating multiple translation methods. The test result can be obtained using tokens and these test results show that our models can achieve modern performance compared to the latest LJ Speech tool kit data. Terms of Reference — Open source, end-to-end, text-to-speech

Sign in / Sign up

Export Citation Format

Share Document