scholarly journals An Incremental Class-Learning Approach with Acoustic Novelty Detection for Acoustic Event Recognition

Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6622
Author(s):  
Barış Bayram ◽  
Gökhan İnce

Acoustic scene analysis (ASA) relies on the dynamic sensing and understanding of stationary and non-stationary sounds from various events, background noises and human actions with objects. However, the spatio-temporal nature of the sound signals may not be stationary, and novel events may exist that eventually deteriorate the performance of the analysis. In this study, a self-learning-based ASA for acoustic event recognition (AER) is presented to detect and incrementally learn novel acoustic events by tackling catastrophic forgetting. The proposed ASA framework comprises six elements: (1) raw acoustic signal pre-processing, (2) low-level and deep audio feature extraction, (3) acoustic novelty detection (AND), (4) acoustic signal augmentations, (5) incremental class-learning (ICL) (of the audio features of the novel events) and (6) AER. The self-learning on different types of audio features extracted from the acoustic signals of various events occurs without human supervision. For the extraction of deep audio representations, in addition to visual geometry group (VGG) and residual neural network (ResNet), time-delay neural network (TDNN) and TDNN based long short-term memory (TDNN–LSTM) networks are pre-trained using a large-scale audio dataset, Google AudioSet. The performances of ICL with AND using Mel-spectrograms, and deep features with TDNNs, VGG, and ResNet from the Mel-spectrograms are validated on benchmark audio datasets such as ESC-10, ESC-50, UrbanSound8K (US8K), and an audio dataset collected by the authors in a real domestic environment.

Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2852
Author(s):  
Parvathaneni Naga Srinivasu ◽  
Jalluri Gnana SivaSai ◽  
Muhammad Fazal Ijaz ◽  
Akash Kumar Bhoi ◽  
Wonjoon Kim ◽  
...  

Deep learning models are efficient in learning the features that assist in understanding complex patterns precisely. This study proposed a computerized process of classifying skin disease through deep learning based MobileNet V2 and Long Short Term Memory (LSTM). The MobileNet V2 model proved to be efficient with a better accuracy that can work on lightweight computational devices. The proposed model is efficient in maintaining stateful information for precise predictions. A grey-level co-occurrence matrix is used for assessing the progress of diseased growth. The performance has been compared against other state-of-the-art models such as Fine-Tuned Neural Networks (FTNN), Convolutional Neural Network (CNN), Very Deep Convolutional Networks for Large-Scale Image Recognition developed by Visual Geometry Group (VGG), and convolutional neural network architecture that expanded with few changes. The HAM10000 dataset is used and the proposed method has outperformed other methods with more than 85% accuracy. Its robustness in recognizing the affected region much faster with almost 2× lesser computations than the conventional MobileNet model results in minimal computational efforts. Furthermore, a mobile application is designed for instant and proper action. It helps the patient and dermatologists identify the type of disease from the affected region’s image at the initial stage of the skin disease. These findings suggest that the proposed system can help general practitioners efficiently and effectively diagnose skin conditions, thereby reducing further complications and morbidity.


Author(s):  
Sophia Bano ◽  
Francisco Vasconcelos ◽  
Emmanuel Vander Poorten ◽  
Tom Vercauteren ◽  
Sebastien Ourselin ◽  
...  

Abstract Purpose Fetoscopic laser photocoagulation is a minimally invasive surgery for the treatment of twin-to-twin transfusion syndrome (TTTS). By using a lens/fibre-optic scope, inserted into the amniotic cavity, the abnormal placental vascular anastomoses are identified and ablated to regulate blood flow to both fetuses. Limited field-of-view, occlusions due to fetus presence and low visibility make it difficult to identify all vascular anastomoses. Automatic computer-assisted techniques may provide better understanding of the anatomical structure during surgery for risk-free laser photocoagulation and may facilitate in improving mosaics from fetoscopic videos. Methods We propose FetNet, a combined convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network architecture for the spatio-temporal identification of fetoscopic events. We adapt an existing CNN architecture for spatial feature extraction and integrated it with the LSTM network for end-to-end spatio-temporal inference. We introduce differential learning rates during the model training to effectively utilising the pre-trained CNN weights. This may support computer-assisted interventions (CAI) during fetoscopic laser photocoagulation. Results We perform quantitative evaluation of our method using 7 in vivo fetoscopic videos captured from different human TTTS cases. The total duration of these videos was 5551 s (138,780 frames). To test the robustness of the proposed approach, we perform 7-fold cross-validation where each video is treated as a hold-out or test set and training is performed using the remaining videos. Conclusion FetNet achieved superior performance compared to the existing CNN-based methods and provided improved inference because of the spatio-temporal information modelling. Online testing of FetNet, using a Tesla V100-DGXS-32GB GPU, achieved a frame rate of 114 fps. These results show that our method could potentially provide a real-time solution for CAI and automating occlusion and photocoagulation identification during fetoscopic procedures.


Author(s):  
Yuheng Hu ◽  
Yili Hong

Residents often rely on newspapers and television to gather hyperlocal news for community awareness and engagement. More recently, social media have emerged as an increasingly important source of hyperlocal news. Thus far, the literature on using social media to create desirable societal benefits, such as civic awareness and engagement, is still in its infancy. One key challenge in this research stream is to timely and accurately distill information from noisy social media data streams to community members. In this work, we develop SHEDR (social media–based hyperlocal event detection and recommendation), an end-to-end neural event detection and recommendation framework with a particular use case for Twitter to facilitate residents’ information seeking of hyperlocal events. The key model innovation in SHEDR lies in the design of the hyperlocal event detector and the event recommender. First, we harness the power of two popular deep neural network models, the convolutional neural network (CNN) and long short-term memory (LSTM), in a novel joint CNN-LSTM model to characterize spatiotemporal dependencies for capturing unusualness in a region of interest, which is classified as a hyperlocal event. Next, we develop a neural pairwise ranking algorithm for recommending detected hyperlocal events to residents based on their interests. To alleviate the sparsity issue and improve personalization, our algorithm incorporates several types of contextual information covering topic, social, and geographical proximities. We perform comprehensive evaluations based on two large-scale data sets comprising geotagged tweets covering Seattle and Chicago. We demonstrate the effectiveness of our framework in comparison with several state-of-the-art approaches. We show that our hyperlocal event detection and recommendation models consistently and significantly outperform other approaches in terms of precision, recall, and F-1 scores. Summary of Contribution: In this paper, we focus on a novel and important, yet largely underexplored application of computing—how to improve civic engagement in local neighborhoods via local news sharing and consumption based on social media feeds. To address this question, we propose two new computational and data-driven methods: (1) a deep learning–based hyperlocal event detection algorithm that scans spatially and temporally to detect hyperlocal events from geotagged Twitter feeds; and (2) A personalized deep learning–based hyperlocal event recommender system that systematically integrates several contextual cues such as topical, geographical, and social proximity to recommend the detected hyperlocal events to potential users. We conduct a series of experiments to examine our proposed models. The outcomes demonstrate that our algorithms are significantly better than the state-of-the-art models and can provide users with more relevant information about the local neighborhoods that they live in, which in turn may boost their community engagement.


2021 ◽  
Vol 14 (8) ◽  
pp. 1289-1297
Author(s):  
Ziquan Fang ◽  
Lu Pan ◽  
Lu Chen ◽  
Yuntao Du ◽  
Yunjun Gao

Traffic prediction has drawn increasing attention for its ubiquitous real-life applications in traffic management, urban computing, public safety, and so on. Recently, the availability of massive trajectory data and the success of deep learning motivate a plethora of deep traffic prediction studies. However, the existing neural-network-based approaches tend to ignore the correlations between multiple types of moving objects located in the same spatio-temporal traffic area, which is suboptimal for traffic prediction analytics. In this paper, we propose a multi-source deep traffic prediction framework over spatio-temporal trajectory data, termed as MDTP. The framework includes two phases: spatio-temporal feature modeling and multi-source bridging. We present an enhanced graph convolutional network (GCN) model combined with long short-term memory network (LSTM) to capture the spatial dependencies and temporal dynamics of traffic in the feature modeling phase. In the multi-source bridging phase, we propose two methods, Sum and Concat, to connect the learned features from different trajectory data sources. Extensive experiments on two real-life datasets show that MDTP i) has superior efficiency, compared with classical time-series methods, machine learning methods, and state-of-the-art neural-network-based approaches; ii) offers a significant performance improvement over the single-source traffic prediction approach; and iii) performs traffic predictions in seconds even on tens of millions of trajectory data. we develop MDTP + , a user-friendly interactive system to demonstrate traffic prediction analysis.


Information ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 145 ◽  
Author(s):  
Zhenglong Xiang ◽  
Xialei Dong ◽  
Yuanxiang Li ◽  
Fei Yu ◽  
Xing Xu ◽  
...  

Most of the existing research papers study the emotion recognition of Minnan songs from the perspectives of music analysis theory and music appreciation. However, these investigations do not explore any possibility of carrying out an automatic emotion recognition of Minnan songs. In this paper, we propose a model that consists of four main modules to classify the emotion of Minnan songs by using the bimodal data—song lyrics and audio. In the proposed model, an attention-based Long Short-Term Memory (LSTM) neural network is applied to extract lyrical features, and a Convolutional Neural Network (CNN) is used to extract the audio features from the spectrum. Then, two kinds of extracted features are concatenated by multimodal compact bilinear pooling, and finally, the concatenated features are input to the classifying module to determine the song emotion. We designed three experiment groups to investigate the classifying performance of combinations of the four main parts, the comparisons of proposed model with the current approaches and the influence of a few key parameters on the performance of emotion recognition. The results show that the proposed model exhibits better performance over all other experimental groups. The accuracy, precision and recall of the proposed model exceed 0.80 in a combination of appropriate parameters.


2019 ◽  
Vol 9 (14) ◽  
pp. 2861 ◽  
Author(s):  
Alessandro Crivellari ◽  
Euro Beinat

The interest in human mobility analysis has increased with the rapid growth of positioning technology and motion tracking, leading to a variety of studies based on trajectory recordings. Mapping the routes that people commonly perform was revealed to be very useful for location-based service applications, where individual mobility behaviors can potentially disclose meaningful information about each customer and be fruitfully used for personalized recommendation systems. This paper tackles a novel trajectory labeling problem related to the context of user profiling in “smart” tourism, inferring the nationality of individual users on the basis of their motion trajectories. In particular, we use large-scale motion traces of short-term foreign visitors as a way of detecting the nationality of individuals. This task is not trivial, relying on the hypothesis that foreign tourists of different nationalities may not only visit different locations, but also move in a different way between the same locations. The problem is defined as a multinomial classification with a few tens of classes (nationalities) and sparse location-based trajectory data. We hereby propose a machine learning-based methodology, consisting of a long short-term memory (LSTM) neural network trained on vector representations of locations, in order to capture the underlying semantics of user mobility patterns. Experiments conducted on a real-world big dataset demonstrate that our method achieves considerably higher performances than baseline and traditional approaches.


2021 ◽  
Vol 33 (4) ◽  
pp. 593-608
Author(s):  
Chuhao Zhou ◽  
Peiqun Lin ◽  
Xukun Lin ◽  
Yang Cheng

Accurate traffic prediction on a large-scale road network is significant for traffic operations and management. In this study, we propose an equation for achieving a comprehensive and accurate prediction that effectively combines traffic data and non-traffic data. Based on that, we developed a novel prediction model, called the adaptive deep neural network (ADNN). In the ADNN, we use two long short-term memory (LSTM) networks to extract spatial-temporal characteristics and temporal characteristics, respectively. A backpropagation neural network (BPNN) is also employed to represent situations from contextual factors such as station index, forecast horizon, and weather. The experimental results show that the prediction of ADNN for different stations and different forecast horizons has high accuracy; even for one hour ahead, its performance is also satisfactory. The comparison of ADNN and several benchmark prediction models also indicates the robustness of the ADNN.


Sentiment analysis can be used to study an individual or a group’s emotions and attitudes towards other people and entities like products, services, or social events. With the advancements in the field of deep learning, the enormity of available information on internet, chiefly on social media, combined with powerful computing machines, it’s just a matter of time before artificial intelligence (AI) systems make their presence in every aspect of human life, making our lives more introspective. In this paper, we propose to implement a multimodal sentiment prediction system that can analyze the emotions predicted from different modal sources such as video, audio and text and integrate them to recognize the group emotions of the students in a classroom. Our experimental setup involves a digital video camera with microphones to capture the live video and audio feeds of the students during a lecture. The students are advised to provide their digital feedback on the lecture as ‘tweets’ on their twitter account addressed to the lecturer’s official twitter account. The audio and video frames are separated from the live streaming video using tools such as lame and ffmpeg. A twitter API was used to access and extract messages from twitter platform. The audio and video features are extracted using Mel-Frequency Cepstral Co-efficients (MFCC) and Haar Cascades classifier respectively. The extracted features are then passed to the Convolutional Neural Network (CNN) model trained on the FER2013 facial images database to generate the feature vector for classification of video-based emotions. A Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM), trained on speech emotion corpus database was used to train on the audio features. A lexicon-based approach with senti-word dictionary and learning based approach with custom dataset trained by Support Vector Machines (SVM) was used in the twitter-texts based approach. A decision-level fusion algorithm was applied on these three different modal schemes to integrate the classification results and deduce the overall group emotions of the students. The use-case of this proposed system will be in student emotion recognition, employee performance feedback, monitoring or surveillance-based systems. The implemented system framework was tested in a classroom environment during a live lecture and the predicted emotions demonstrated the classification accuracy of our approach.


Sign in / Sign up

Export Citation Format

Share Document