asr system
Recently Published Documents


TOTAL DOCUMENTS

153
(FIVE YEARS 65)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.


2022 ◽  
Vol 12 (2) ◽  
pp. 804
Author(s):  
Pau Baquero-Arnal ◽  
Javier Jorge ◽  
Adrià Giménez ◽  
Javier Iranzo-Sánchez ◽  
Alejandro Pérez ◽  
...  

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.


Author(s):  
Siqing Qin ◽  
Longbiao Wang ◽  
Sheng Li ◽  
Jianwu Dang ◽  
Lixin Pan

AbstractConventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers and transcriptions. Hence, it is meaningful to apply the ASR technique to the Lhasa dialect for historical heritage protection and cultural exchange. Previous work on Tibetan speech recognition focused on selecting phone-level acoustic modeling units and incorporating tonal information but underestimated the influence of limited data. The purpose of this paper is to improve the speech recognition performance of the low-resource Lhasa dialect by adopting multilingual speech recognition technology on the E2E structure based on the transfer learning framework. Using transfer learning, we first establish a monolingual E2E ASR system for the Lhasa dialect with different source languages to initialize the ASR model to compare the positive effects of source languages on the Tibetan ASR model. We further propose a multilingual E2E ASR system by utilizing initialization strategies with different source languages and multilevel units, which is proposed for the first time. Our experiments show that the performance of the proposed method-based ASR system exceeds that of the E2E baseline ASR system. Our proposed method effectively models the low-resource Lhasa dialect and achieves a relative 14.2% performance improvement in character error rate (CER) compared to DNN-HMM systems. Moreover, from the best monolingual E2E model to the best multilingual E2E model of the Lhasa dialect, the system’s performance increased by 8.4% in CER.


2021 ◽  
Author(s):  
Puneet Bawa ◽  
Virender Kadyan ◽  
Vaibhav Kumar ◽  
Ghanshyam Raghuwanshi

Abstract In real-life applications, noise originating from different sound sources modifies the characteristics of an input signal which affects the development of an enhanced ASR system. This contamination degrades the quality and comprehension of speech variables while impacting the performance of human-machine communication systems. This paper aims to minimise noise challenges by using a robust feature extraction methodology through introduction of an optimised filtering technique. Initially, the evaluations for enhancing input signals are constructed by using state transformation matrix and minimising a mean square error based upon the linear time variance techniques of Kalman and Adaptive Wiener Filtering. Consequently, Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficient (LPCC), RelAtive SpecTrAl-Perceptual Linear Prediction (RASTA-PLP) and Gammatone Frequency cepstral coefficient (GFCC) based feature extraction methods have been synthesised with their comparable efficiency in order to derive the adequate characteristics of a signal. It also handle the large-scale training complexities lies among the training and testing dataset. Consequently, the acoustic mismatch and linguistic complexity of large-scale variations lies within small set of speakers have been handle by utilising the Vocal Tract Length Normalization (VTLN) based warping of the test utterances. Furthermore, the spectral warping approach has been used by time reversing the samples inside a frame and passing them into the filter network corresponding to each frame. Finally, the overall Relative Improvement (RI) of 16.13% on 5-way perturbed spectral warped based noise augmented dataset through Wiener Filtering in comparison to other systems respectively.


Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6460
Author(s):  
Marco Marini ◽  
Nicola Vanello ◽  
Luca Fanucci

Within the field of Automatic Speech Recognition (ASR) systems, facing impaired speech is a big challenge because standard approaches are ineffective in the presence of dysarthria. The first aim of our work is to confirm the effectiveness of a new speech analysis technique for speakers with dysarthria. This new approach exploits the fine-tuning of the size and shift parameters of the spectral analysis window used to compute the initial short-time Fourier transform, to improve the performance of a speaker-dependent ASR system. The second aim is to define if there exists a correlation among the speaker’s voice features and the optimal window and shift parameters that minimises the error of an ASR system, for that specific speaker. For our experiments, we used both impaired and unimpaired Italian speech. Specifically, we used 30 speakers with dysarthria from the IDEA database and 10 professional speakers from the CLIPS database. Both databases are freely available. The results confirm that, if a standard ASR system performs poorly with a speaker with dysarthria, it can be improved by using the new speech analysis. Otherwise, the new approach is ineffective in cases of unimpaired and low impaired speech. Furthermore, there exists a correlation between some speaker’s voice features and their optimal parameters.


Water ◽  
2021 ◽  
Vol 13 (18) ◽  
pp. 2595
Author(s):  
Hongkai Li ◽  
Yu Ye ◽  
Chunhui Lu

Aquifer storage and recovery (ASR) refers to injecting freshwater into an aquifer and later withdrawing it. In brackish-to-saline aquifers, density-driven convection and fresh-saline water mixing lead to a reduced recovery efficiency (RE, i.e., the volumetric ratio between recovered potable water and injected freshwater) of ASR. For a layered aquifer, previous studies assume a constant hydraulic conductivity ratio between neighboring layers. In order to reflect the realistic formation of layered aquifers, we systematically investigate 120 layered heterogeneous scenarios with different layer arrangements on multiple-cycle ASR using numerical simulations. Results show that the convection (as is reflected by the tilt of the fresh-saline interface) and mixing phenomena of the ASR system vary significantly among scenarios with different layer arrangements. In particular, the lower permeable layer underlying the higher permeable layer restricts the free convection and leads to the spreading of salinity at the bottom of the higher permeable layer and early salt breakthrough to the well. Correspondingly, the RE values are different among the heterogeneous scenarios, with a maximum absolute RE difference of 22% for the first cycle and 9% for the tenth cycle. Even though the difference in RE decreases with more ASR cycles, it is still non-negligible and needs to be considered after ten ASR cycles. The method to homogenize the layered heterogeneity by simply taking the arithmetic and geometric means of the hydraulic conductivities among different layers as the horizontal and vertical hydraulic conductivities is shown to overestimate the RE for multiple-cycle ASR. The outcomes of this research illustrate the importance of considering the geometric arrangement of layers in assessing the feasibility of multiple-cycle ASR operations in brackish-to-saline layered aquifers.


2021 ◽  
Author(s):  
Zhifu Gao ◽  
Yiwu Yao ◽  
Shiliang Zhang ◽  
Jun Yang ◽  
Ming Lei ◽  
...  
Keyword(s):  

2021 ◽  
Author(s):  
Ekaterina Egorova ◽  
Hari Krishna Vydana ◽  
Lukáš Burget ◽  
Jan Černocký
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document