scholarly journals A Speech Quality Classifier based on Tree-CNN Algorithm that Considers Network Degradations

2020 ◽  
Vol 16 (2) ◽  
pp. 180-187
Author(s):  
Samuel Terra Vieira ◽  
Renata Lopes Rosa ◽  
Demóstenes Zegarra Rodríguez

Many factors can affect the users’ quality of experience (QoE) in speech communication services. The impairment factors appear due to physical phenomena that occur in the transmission channel of wireless and wired networks. The monitoring of users’ QoE is important for service providers. In this context, a non-intrusive speech quality classifier based on the Tree Convolutional Neural Network (Tree-CNN) is proposed. The Tree-CNN is an adaptive network structure composed of hierarchical CNNs models, and its main advantage is to decrease the training time that is very relevant on speech quality assessment methods. In the training phase of the proposed classifier model, impaired speech signals caused by wired and wireless network degradation are used as input. Also, in the network scenario, different modulation schemes and channel degradation intensities, such as packet loss rate, signal-to-noise ratio, and maximum Doppler shift frequencies are implemented. Experimental results demonstrated that the proposed model achieves significant reduction of training time, reaching 25% of reduction in relation to another implementation based on DRBM. The accuracy reached by the Tree-CNN model is almost 95% for each quality class. Performance assessment results show that the proposed classifier based on the Tree-CNN overcomes both the current standardized algorithm described in ITU-T Rec. P.563 and the speech quality assessment method called ViSQOL.

2015 ◽  
Vol 781 ◽  
pp. 93-97
Author(s):  
Natradee Anupong ◽  
Patsita Sirawongphatsara ◽  
Pongpisit Wuttidittachotti ◽  
Therdpong Daengsi

This paper presents a study of 3G network performance in the inner city of Bangkok using a video quality assessment method, called Full-Reference Assessment. After the study, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) have been obtained. The results show that 3G networks from two 3G service providers tends to provide better video quality than the rest of 3G service providers, which are consistent with the speed test results.


Author(s):  
Miao Liu ◽  
Jing Wang ◽  
Weiming Yi ◽  
Fang Liu

AbstractRecently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.


2021 ◽  
Author(s):  
◽  
Mouna Hakami

<p><b>This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment. To outperform existing methods, we concentrate on enhancing the feature set. In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data.</b></p> <p>Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality. The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score. We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective. We propose an augmented feature set that includes raw features that are presumably redundant. The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input. Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions. We introduce a new pre-processing method and re-distribute the features to facilitate the training. The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature.</p> <p>The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals. Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples. The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech. The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial. To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space.</p> <p>Generative models that map simple latent distributions into complex distributions are excellent platforms for our work. Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech.</p> <p>Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space. We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech.</p> <p>Generative adversarial nets (GAN) are an effective training method and work well in this application. The proposed system is a novel application for a GAN. The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.</p>


Sign in / Sign up

Export Citation Format

Share Document