scholarly journals An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Symmetry ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 1728
Author(s):  
Ascensión Gallardo-Antolín ◽  
Juan M. Montero

Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.

2020 ◽  
Vol 1 (2) ◽  
Author(s):  
Ritika Sibal ◽  
Ding Zhang ◽  
Julie Rocho-Levine ◽  
K. Alex Shorter ◽  
Kira Barton

Abstract Behavior of animals living in the wild is often studied using visual observations made by trained experts. However, these observations tend to be used to classify behavior during discrete time periods and become more difficult when used to monitor multiple individuals for days or weeks. In this work, we present automatic tools to enable efficient behavior and dynamic state estimation/classification from data collected with animal borne bio-logging tags, without the need for statistical feature engineering. A combined framework of an long short-term memory (LSTM) network and a hidden Markov model (HMM) was developed to exploit sequential temporal information in raw motion data at two levels: within and between windows. Taking a moving window data segmentation approach, LSTM estimates the dynamic state corresponding to each window by parsing the contiguous raw data points within the window. HMM then links all of the individual window estimations and further improves the overall estimation. A case study with bottlenose dolphins was conducted to demonstrate the approach. The combined LSTM–HMM method achieved a 6% improvement over conventional methods such as K-nearest neighbor (KNN) and support vector machine (SVM), pushing the accuracy above 90%. In addition to performance improvements, the proposed method requires a similar amount of training data to traditional machine learning methods, making the method easily adaptable to new tasks.


Electronics ◽  
2020 ◽  
Vol 9 (5) ◽  
pp. 721 ◽  
Author(s):  
Barath Narayanan Narayanan ◽  
Venkata Salini Priyamvada Davuluru

With the advancement of technology, there is a growing need of classifying malware programs that could potentially harm any computer system and/or smaller devices. In this research, an ensemble classification system comprising convolutional and recurrent neural networks is proposed to distinguish malware programs. Microsoft’s Malware Classification Challenge (BIG 2015) dataset with nine distinct classes is utilized for this study. This dataset contains an assembly file and a compiled file for each malware program. Compiled files are visualized as images and are classified using Convolutional Neural Networks (CNNs). Assembly files consist of machine language opcodes that are distinguished among classes using Long Short-Term Memory (LSTM) networks after converting them into sequences. In addition, features are extracted from these architectures (CNNs and LSTM) and are classified using a support vector machine or logistic regression. An accuracy of 97.2% is achieved using LSTM network for distinguishing assembly files, 99.4% using CNN architecture for classifying compiled files and an overall accuracy of 99.8% using the proposed ensemble approach thereby setting a new benchmark. An independent and automated classification system for assembly and/or compiled files provides the luxury to anti-malware industry experts to choose the type of system depending on their available computational resources.


Author(s):  
Hongguang Pan ◽  
Tao Su ◽  
Xiangdong Huang ◽  
Zheng Wang

To address problems of high cost, complicated process and low accuracy of oxygen content measurement in flue gas of coal-fired power plant, a method based on long short-term memory (LSTM) network is proposed in this paper to replace oxygen sensor to estimate oxygen content in flue gas of boilers. Specifically, first, the LSTM model was built with the Keras deep learning framework, and the accuracy of the model was further improved by selecting appropriate super-parameters through experiments. Secondly, the flue gas oxygen content, as the leading variable, was combined with the mechanism and boiler process primary auxiliary variables. Based on the actual production data collected from a coal-fired power plant in Yulin, China, the data sets were preprocessed. Moreover, a selection model of auxiliary variables based on grey relational analysis is proposed to construct a new data set and divide the training set and testing set. Finally, this model is compared with the traditional soft-sensing modelling methods (i.e. the methods based on support vector machine and BP neural network). The RMSE of LSTM model is 4.51% lower than that of GA-SVM model and 3.55% lower than that of PSO-BP model. The conclusion shows that the oxygen content model based on LSTM has better generalization and has certain industrial value.


Sensors ◽  
2022 ◽  
Vol 22 (2) ◽  
pp. 545
Author(s):  
Bor-Jiunn Hwang ◽  
Hui-Hui Chen ◽  
Chaur-Heh Hsieh ◽  
Deng-Yu Huang

Based on experimental observations, there is a correlation between time and consecutive gaze positions in visual behaviors. Previous studies on gaze point estimation usually use images as the input for model trainings without taking into account the sequence relationship between image data. In addition to the spatial features, the temporal features are considered to improve the accuracy in this paper by using videos instead of images as the input data. To be able to capture spatial and temporal features at the same time, the convolutional neural network (CNN) and long short-term memory (LSTM) network are introduced to build a training model. In this way, CNN is used to extract the spatial features, and LSTM correlates temporal features. This paper presents a CNN Concatenating LSTM network (CCLN) that concatenates spatial and temporal features to improve the performance of gaze estimation in the case of time-series videos as the input training data. In addition, the proposed model can be optimized by exploring the numbers of LSTM layers, the influence of batch normalization (BN) and global average pooling layer (GAP) on CCLN. It is generally believed that larger amounts of training data will lead to better models. To provide data for training and prediction, we propose a method for constructing datasets of video for gaze point estimation. The issues are studied, including the effectiveness of different commonly used general models and the impact of transfer learning. Through exhaustive evaluation, it has been proved that the proposed method achieves a better prediction accuracy than the existing CNN-based methods. Finally, 93.1% of the best model and 92.6% of the general model MobileNet are obtained.


Author(s):  
Farshid Rahmani ◽  
Chaopeng Shen ◽  
Samantha Oliver ◽  
Kathryn Lawson ◽  
Alison Appling

Basin-centric long short-term memory (LSTM) network models have recently been shown to be an exceptionally powerful tool for simulating stream temperature (Ts, temperature measured in rivers), among other hydrological variables. However, spatial extrapolation is a well-known challenge to modeling Ts and it is uncertain how an LSTM-based daily Ts model will perform in unmonitored or dammed basins. Here we compiled a new benchmark dataset consisting of >400 basins for across the contiguous United States in different data availability groups (DAG, meaning the daily sampling frequency) with or without major dams and study how to assemble suitable training datasets for predictions in monitored or unmonitored situations. For temporal generalization, CONUS-median best root-mean-square error (RMSE) values for sites with extensive (99%), intermediate (60%), scarce (10%) and absent (0%, unmonitored) data for training were 0.75, 0.83, 0.88, and 1.59°C, representing the state of the art. For prediction in unmonitored basins (PUB), LSTM’s results surpassed those reported in the literature. Even for unmonitored basins with major reservoirs, we obtained a median RMSE of 1.492°C and an R2 of 0.966. The most suitable training set was the matching DAG that the basin could be grouped into, e.g., the 60% DAG for a basin with 61% data availability. However, for PUB, a training dataset including all basins with data is preferred. An input-selection ensemble moderately mitigated attribute overfitting. Our results suggest there are influential latent processes not sufficiently described by the inputs (e.g., geology, wetland covers), but temporal fluctuations are well predictable, and LSTM appears to be the more accurate Ts modeling tool when sufficient training data are available.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Cach N. Dang ◽  
María N. Moreno-García ◽  
Fernando De la Prieta

Sentiment analysis on public opinion expressed in social networks, such as Twitter or Facebook, has been developed into a wide range of applications, but there are still many challenges to be addressed. Hybrid techniques have shown to be potential models for reducing sentiment errors on increasingly complex training data. This paper aims to test the reliability of several hybrid techniques on various datasets of different domains. Our research questions are aimed at determining whether it is possible to produce hybrid models that outperform single models with different domains and types of datasets. Hybrid deep sentiment analysis learning models that combine long short-term memory (LSTM) networks, convolutional neural networks (CNN), and support vector machines (SVM) are built and tested on eight textual tweets and review datasets of different domains. The hybrid models are compared against three single models, SVM, LSTM, and CNN. Both reliability and computation time were considered in the evaluation of each technique. The hybrid models increased the accuracy for sentiment analysis compared with single models on all types of datasets, especially the combination of deep learning models with SVM. The reliability of the latter was significantly higher.


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2811
Author(s):  
Waseem Ullah ◽  
Amin Ullah ◽  
Tanveer Hussain ◽  
Zulfiqar Ahmad Khan ◽  
Sung Wook Baik

Video anomaly recognition in smart cities is an important computer vision task that plays a vital role in smart surveillance and public safety but is challenging due to its diverse, complex, and infrequent occurrence in real-time surveillance environments. Various deep learning models use significant amounts of training data without generalization abilities and with huge time complexity. To overcome these problems, in the current work, we present an efficient light-weight convolutional neural network (CNN)-based anomaly recognition framework that is functional in a surveillance environment with reduced time complexity. We extract spatial CNN features from a series of video frames and feed them to the proposed residual attention-based long short-term memory (LSTM) network, which can precisely recognize anomalous activity in surveillance videos. The representative CNN features with the residual blocks concept in LSTM for sequence learning prove to be effective for anomaly detection and recognition, validating our model’s effective usage in smart cities video surveillance. Extensive experiments on the real-world benchmark UCF-Crime dataset validate the effectiveness of the proposed model within complex surveillance environments and demonstrate that our proposed model outperforms state-of-the-art models with a 1.77%, 0.76%, and 8.62% increase in accuracy on the UCF-Crime, UMN and Avenue datasets, respectively.


SLEEP ◽  
2019 ◽  
Vol 43 (1) ◽  
Author(s):  
Jelena Skorucak ◽  
Anneke Hertig-Godeschalk ◽  
David R Schreier ◽  
Alexander Malafeev ◽  
Johannes Mathis ◽  
...  

Abstract Study Objectives Microsleep episodes (MSEs) are brief episodes of sleep, mostly defined to be shorter than 15 s. In the electroencephalogram (EEG), MSEs are mainly characterized by a slowing in frequency. The identification of early signs of sleepiness and sleep (e.g. MSEs) is of considerable clinical and practical relevance. Under laboratory conditions, the maintenance of wakefulness test (MWT) is often used for assessing vigilance. Methods We analyzed MWT recordings of 76 patients referred to the Sleep-Wake-Epilepsy-Center. MSEs were scored by experts defined by the occurrence of theta dominance on ≥1 occipital derivation lasting 1–15 s, whereas the eyes were at least 80% closed. We calculated spectrograms using an autoregressive model of order 16 of 1 s epochs moved in 200 ms steps in order to visualize oscillatory activity and derived seven features per derivation: power in delta, theta, alpha and beta bands, ratio theta/(alpha + beta), quantified eye movements, and median frequency. Three algorithms were used for MSE classification: support vector machine (SVM), random forest (RF), and an artificial neural network (long short-term memory [LSTM] network). Data of 53 patients were used for the training of the classifiers, and 23 for testing. Results MSEs were identified with a high performance (sensitivity, specificity, precision, accuracy, and Cohen’s kappa coefficient). Training revealed that delta power and the ratio theta/(alpha + beta) were most relevant features for the RF classifier and eye movements for the LSTM network. Conclusions The automatic detection of MSEs was successful for our EEG-based definition of MSEs, with good performance of all algorithms applied.


Author(s):  
Keith April Araño ◽  
Peter Gloor ◽  
Carlotta Orsenigo ◽  
Carlo Vercellis

AbstractSpeech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human–machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.


2021 ◽  
Vol 2093 (1) ◽  
pp. 012006
Author(s):  
Zhijun Gao ◽  
Qiaoyu Gu ◽  
Zhonghua Han

Abstract Aiming at the problem that the exiting human skeleton-based action recognition methods cannot fully extract the relevant information before and after the action, resulting in low utilization efficiency of skeleton points, we propose a two-layer LSTM (long short term memory) network with attention mechanism. The network has two layers, the first LSTM network is used for skeleton coding and initialization of system storage units and the second LSTM network integrates attention mechanism to further process the data of the first layer network. An algorithm is designed to assign different weights to skeleton points according to the importance of human body, which greatly increases the recognition accuracy. Action classification is accomplished by multiple support vector machines. Through training and testing, the average recognition rate of 98.5% is achieved on KTH dataset. The experimental result shows that the proposed method is effective in human behavior recognition.


Sign in / Sign up

Export Citation Format

Share Document