LSTM Deep Neural Networks Postfiltering for Enhancing Synthetic Voices

Recent developments in speech synthesis have produced systems capable of producing speech which closely resembles natural speech, and researchers now strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents. Speech synthesis based on Hidden Markov Models (HMM) is of great interest to researchers, due to its ability to produce sophisticated features with a small footprint. Despite some progress, its quality has not yet reached the level of the current predominant unit-selection approaches, which select and concatenate recordings of real speech, and work has been conducted to try to improve HMM-based systems. In this paper, we present an application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis. Our motivation stems from a similar desire to obtain characteristics which are closer to those of natural speech. The paper analyzes four types of postfilters obtained using five voices, which range from a single postfilter to enhance all the parameters, to a multi-stream proposal which separately enhances groups of parameters. The different proposals are evaluated using three objective measures and are statistically compared to determine any significance between them. The results described in the paper indicate that HMM-based voices can be enhanced using this approach, specially for the multi-stream postfilters on the considered objective measures.

Download Full-text

IoT-Based Bee Swarm Activity Acoustic Classification Using Deep Neural Networks

Sensors ◽

10.3390/s21030676 ◽

2021 ◽

Vol 21 (3) ◽

pp. 676

Author(s):

Andrej Zgank

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Markov Models ◽

Audio Signal ◽

Audio Signals ◽

Mel Frequency Cepstral Coefficients ◽

Animal Activity ◽

The Impact ◽

Acoustic Classification ◽

Swarm Activity

Animal activity acoustic monitoring is becoming one of the necessary tools in agriculture, including beekeeping. It can assist in the control of beehives in remote locations. It is possible to classify bee swarm activity from audio signals using such approaches. A deep neural networks IoT-based acoustic swarm classification is proposed in this paper. Audio recordings were obtained from the Open Source Beehive project. Mel-frequency cepstral coefficients features were extracted from the audio signal. The lossless WAV and lossy MP3 audio formats were compared for IoT-based solutions. An analysis was made of the impact of the deep neural network parameters on the classification results. The best overall classification accuracy with uncompressed audio was 94.09%, but MP3 compression degraded the DNN accuracy by over 10%. The evaluation of the proposed deep neural networks IoT-based bee activity acoustic classification showed improved results if compared to the previous hidden Markov models system.

Download Full-text

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3372244 ◽

2020 ◽

Vol 19 (3) ◽

pp. 1-14

Author(s):

Xiao Zhou ◽

Zhen-Hua Ling ◽

Li-Rong Dai

Keyword(s):

Neural Networks ◽

Speech Synthesis ◽

Deep Neural Networks ◽

Unit Selection

Download Full-text

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Information Technology And Control ◽

10.5755/j01.itc.49.4.26808 ◽

2020 ◽

Vol 49 (4) ◽

pp. 482-494

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Senait Gebremichael Tesfagergish

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Deep Neural Networks ◽

Short Term Memory ◽

Parameter Tuning ◽

Feed Forward Neural Network ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.

Download Full-text

Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition

10.21437/interspeech.2016-39 ◽

2016 ◽

Cited By ~ 5

Author(s):

Liang Lu ◽

Steve Renals

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Small Footprint

Download Full-text

A Study on Tailor-Made Speech Synthesis Based on Deep Neural Networks

Advances in Intelligent Information Hiding and Multimedia Signal Processing - Smart Innovation, Systems and Technologies ◽

10.1007/978-3-319-50209-0_20 ◽

2016 ◽

pp. 159-166

Author(s):

Shuhei Yamada ◽

Takashi Nose ◽

Akinori Ito

Keyword(s):

Neural Networks ◽

Speech Synthesis ◽

Deep Neural Networks

Download Full-text

Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis

10.21437/interspeech.2015-267 ◽

2015 ◽

Author(s):

Yuchen Fan ◽

Yao Qian ◽

Frank K. Soong ◽

Lei He

Keyword(s):

Neural Networks ◽

Speech Synthesis ◽

Deep Neural Networks ◽

Text To Speech ◽

Sequence Generation ◽

Text To Speech Synthesis

Download Full-text

Small-footprint keyword spotting using deep neural networks

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2014.6854370 ◽

2014 ◽

Cited By ~ 126

Author(s):

Guoguo Chen ◽

Carolina Parada ◽

Georg Heigold

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Keyword Spotting ◽

Small Footprint

Download Full-text

Multiple Source Localization in a Shallow Water Waveguide Exploiting Subarray Beamforming and Deep Neural Networks

Sensors ◽

10.3390/s19214768 ◽

2019 ◽

Vol 19 (21) ◽

pp. 4768 ◽

Cited By ~ 2

Author(s):

Zhaoqiong Huang ◽

Ji Xu ◽

Zaixiao Gong ◽

Haibin Wang ◽

Yonghong Yan

Keyword(s):

Neural Network ◽

Neural Networks ◽

Shallow Water ◽

Source Localization ◽

Deep Neural Networks ◽

Short Term Memory ◽

Direction Finding ◽

Multiple Source ◽

Feed Forward Neural Network ◽

Subarray Beamforming

Deep neural networks (DNNs) have been shown to be effective for single sound source localization in shallow water environments. However, multiple source localization is a more challenging task because of the interactions among multiple acoustic signals. This paper proposes a framework for multiple source localization on underwater horizontal arrays using deep neural networks. The two-stage DNNs are adopted to determine both the directions and ranges of multiple sources successively. A feed-forward neural network is trained for direction finding, while the long short term memory recurrent neural network is used for source ranging. Particularly, in the source ranging stage, we perform subarray beamforming to extract features of sources that are detected by the direction finding stage, because subarray beamforming can enhance the mixed signal to the desired direction while preserving the horizontal-longitudinal correlations of the acoustic field. In this way, a universal model trained in the single-source scenario can be applied to multi-source scenarios with arbitrary numbers of sources. Both simulations and experiments in a range-independent shallow water environment of SWellEx-96 Event S5 are given to demonstrate the effectiveness of the proposed method.

Download Full-text