single speaker
Recently Published Documents


TOTAL DOCUMENTS

70
(FIVE YEARS 27)

H-INDEX

9
(FIVE YEARS 2)

2021 ◽  
Vol 13 (1) ◽  
pp. 35-50
Author(s):  
Primož Vitez

The system of French accentuation is a relevant case of a language change, observable in a relatively short period of time in a stable synchrony. Since the mid-twentieth century, the formation of linguistic norms has largely depended on a specific type of utterance, media discourse, continually available in spoken audio-visual media. The impact of spoken media on the development of linguistic expression in the last few decades is unprecedented in language history. It is based on a communicational model in which speech is produced by a single speaker and instantly perceived by a multitude of receivers who have no possibility of intervening in the communicational process. Thus the receivers are passively exposed to an exclusive speaker and to language strategies conceived by the media and its linguistic authority. The analysis of two professional spoken interventions, uttered on French television, shows an important modification of the traditional accentual system: conserving the final accent (FA), the speakers systematically introduce an initial accent (IA), a landmark in the evolution of the French language and its normative features. The IA affects the first syllable of a stressed lexeme or the first syllable of an extended accentual unit, regardless of the syntactic function of the stressed morpheme. The FA is operated by the intonational action, while the IA seems to be realized by an accentual augmentation of vocal intensity. The automatism of lexical stressing is generating a systematic accentuation of the first syllable of the accentual unit. The IA mostly affects lexemes that speakers insist on because of their informative value (numerals, adverbs, proper names), but an important part of IA concerns different proclytics, such as deictic elements, articles and determinants. Accentual limitation of the unit on both sides is a specific feature of the speech in French audio-visual media. In recent decades it has found its echo in the normative speech of French linguistic communities.


Author(s):  
Zolzaya Byambadorj ◽  
Ryota Nishimura ◽  
Altangerel Ayush ◽  
Kengo Ohta ◽  
Norihide Kitaoka

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.


2021 ◽  
Author(s):  
Masanari Nakamura ◽  
Kento Fujimoto ◽  
Hiroaki Murakami ◽  
Hiromichi Hashizume ◽  
Masanori Sugimoto

2021 ◽  
pp. 1-13
Author(s):  
Hamzah A. Alsayadi ◽  
Abdelaziz A. Abdelhamid ◽  
Islam Hegazy ◽  
Zaki T. Fayed

Arabic language has a set of sound letters called diacritics, these diacritics play an essential role in the meaning of words and their articulations. The change in some diacritics leads to a change in the context of the sentence. However, the existence of these letters in the corpus transcription affects the accuracy of speech recognition. In this paper, we investigate the effect of diactrics on the Arabic speech recognition based end-to-end deep learning. The applied end-to-end approach includes CNN-LSTM and attention-based technique presented in the state-of-the-art framework namely, Espresso using Pytorch. In addition, and to the best of our knowledge, the approach of CNN-LSTM with attention-based has not been used in the task of Arabic Automatic speech recognition (ASR). To fill this gap, this paper proposes a new approach based on CNN-LSTM with attention based method for Arabic ASR. The language model in this approach is trained using RNN-LM and LSTM-LM and based on nondiacritized transcription of the speech corpus. The Standard Arabic Single Speaker Corpus (SASSC), after omitting the diacritics, is used to train and test the deep learning model. Experimental results show that the removal of diacritics decreased out-of-vocabulary and perplexity of the language model. In addition, the word error rate (WER) is significantly improved when compared to diacritized data. The achieved average reduction in WER is 13.52%.


2021 ◽  
Vol 108 (Supplement_6) ◽  
Author(s):  
Z Shakoor ◽  
C West

Abstract Aim 1. Assess performance in surgical handovers at Southampton General Hospital (SGH) against RCS ‘Safe Handover’ guidelines Identify any areas for improvement to ensure safe and effective handover of surgical patients Method 10 evening surgical handovers were anonymously audited In October 2019 against RCS ‘safe handover’ guidelines. The results were subsequently analysed and circulated amongst the surgical department. Handovers were then led consistently by surgical registrars and advanced nurse practitioners (ANPs). A prompt including the RCS handover guidelines was made and distributed to all members of the surgical team and included in departmental inductions. Following this, a further 10 evening handovers were anonymously audited between July and August 2020. Results Many aspects of handover performance descriptors described by the RCS in the re-audit improved following the circulation of our prompt including RCS handover guidelines and examples of minimum or good standards of practice for handover. Specifically, handover timeliness, the briefings provided (100% from 70%), the audibility of a single speaker (70% from 30%), the number of educational discussions held during handovers (100% from 50%) and awareness of the on-call overnight consultant (100% from 80%) all vastly improved. Conclusions Emphasis on undertaking effective handovers needs to continue as ‘safe' handovers between shifts can protect both patient and doctor safety. This is especially true following the implementation of the European Working Time Directive (EWTD) and a move to full shift working. Handovers are also proposed as opportunities for training which may be helpful especially in an era of reduced hours of surgical training.


2021 ◽  
pp. 29-54
Author(s):  
Eric S. Henry

This chapter explores how Mandarin, Dongbeihua, and English are constituted and enacted in everyday forms of discourse in Shenyang, but also how they interact in linguistically complex ways. Despite their presumably separate status as unique linguistic codes, they frequently manifest themselves in the voice of a single speaker, although at differing times and in differing contexts. The chapter shows how their coherence as separate codes is not given beforehand but a product of metapragmatic discourse that regiments and organizes speech ideologically into different orders of indexicality. Various types of speech are tied to distinctions and categories in the sociocultural field, enregistering equivalences between hierarchically ranked linguistic categories and stratified social categories. It is therefore the comparability between linguistic codes and the ways they point to contrasting stances and social roles that is of interest here.


Author(s):  
Hodaya Hammer ◽  
Shlomo E. Chazan ◽  
Jacob Goldberger ◽  
Sharon Gannot

AbstractIn this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.


2021 ◽  
Vol 9 (1) ◽  
pp. 40-49
Author(s):  
Dana Osborne

AbstractThis analysis examines the ways in which a single speaker, Ana, born in mid-century East Los Angeles, organizes and reflects upon her experiences of the city through language. Ana’s story is one that sheds light on the experiences of many Mexican Americans who came of age at a critical time in a transitioning L.A., and the slow move of people who had been up until mid-century relegated largely in and around racially and socioeconomically segregated parts of L.A. These formative experiences are demonstrated to have informed the ways that speakers parse the social and geographical landscape along several dimensions, and this analysis interrogates the symbolic value of a special category of everyday language, deixis, to reveal the intersection between language and social experience in the cityscape of L.A. In this way, it is analytically possible to not only approach the habituation and reproduction of specific deictic fields as indexical of the ways that speakers parse the city, but also to demonstrate the ways in which key moments in the history of the city have shaped the emergence and meaning of those fields.


Author(s):  
Bracha Laufer-Goldshtein ◽  
Ronen Talmon ◽  
Sharon Gannot

AbstractTwo novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.


Sign in / Sign up

Export Citation Format

Share Document