audio features
Recently Published Documents


TOTAL DOCUMENTS

284
(FIVE YEARS 111)

H-INDEX

20
(FIVE YEARS 6)

2021 ◽  
Vol 3 (4) ◽  
pp. 1030-1054
Author(s):  
Olav Andre Nergård Rongved ◽  
Markus Stige ◽  
Steven Alexander Hicks ◽  
Vajira Lasantha Thambawita ◽  
Cise Midoglu ◽  
...  

Detecting events in videos is a complex task, and many different approaches, aimed at a large variety of use-cases, have been proposed in the literature. Most approaches, however, are unimodal and only consider the visual information in the videos. This paper presents and evaluates different approaches based on neural networks where we combine visual features with audio features to detect (spot) and classify events in soccer videos. We employ model fusion to combine different modalities such as video and audio, and test these combinations against different state-of-the-art models on the SoccerNet dataset. The results show that a multimodal approach is beneficial. We also analyze how the tolerance for delays in classification and spotting time, and the tolerance for prediction accuracy, influence the results. Our experiments show that using multiple modalities improves event detection performance for certain types of events.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8356
Author(s):  
Ha Thi Phuong Thao ◽  
B T Balamurali ◽  
Gemma Roig ◽  
Dorien Herremans

In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.


2021 ◽  
Author(s):  
Nathan Chi ◽  
Peter Washington ◽  
Aaron Kline ◽  
Arman Husic ◽  
Cathy Hou ◽  
...  

BACKGROUND Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process which requires the work of trained physicians, significant attention has been given to developing systems that automatically diagnose and screen for autism. OBJECTIVE Prosody abnormalities are among the most clear signs of autism, with affected children displaying speech idiosyncrasies (including echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns). In this work, we present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments. METHODS We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features (including Mel-frequency cepstral coefficients); second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0—a state-of-the-art Transformer-based speech recognition model. We train our classifiers on our novel dataset of cellphone-recorded child speech audio curated from Stanford’s Guess What? mobile game, an app designed to crowdsource videos of autistic and neurotypical children in a natural home environment. RESULTS The Random Forest classifier achieves 70% accuracy, the fine-tuned wav2vec 2.0 model achieves 77% accuracy, and the CNN achieves 79% accuracy when classifying children’s audio as either ASD or NT. We use five-fold cross-validation to evaluate model performance. CONCLUSIONS Our models were able to predict autism status when training on a varied selection of home audio clips with inconsistent recording qualities, which may be more generalizable to real world conditions. The results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.


2021 ◽  
Vol 3 ◽  
Author(s):  
Alice Baird ◽  
Andreas Triantafyllopoulos ◽  
Sandra Zänkert ◽  
Sandra Ottl ◽  
Lukas Christ ◽  
...  

Life in modern societies is fast-paced and full of stress-inducing demands. The development of stress monitoring methods is a growing area of research due to the personal and economic advantages that timely detection provides. Studies have shown that speech-based features can be utilised to robustly predict several physiological markers of stress, including emotional state, continuous heart rate, and the stress hormone, cortisol. In this contribution, we extend previous works by the authors, utilising three German language corpora including more than 100 subjects undergoing a Trier Social Stress Test protocol. We present cross-corpus and transfer learning results which explore the efficacy of the speech signal to predict three physiological markers of stress—sequentially measured saliva-based cortisol, continuous heart rate as beats per minute (BPM), and continuous respiration. For this, we extract several features from audio as well as video and apply various machine learning architectures, including a temporal context-based Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). For the task of predicting cortisol levels from speech, deep learning improves on results obtained by conventional support vector regression—yielding a Spearman correlation coefficient (ρ) of 0.770 and 0.698 for cortisol measurements taken 10 and 20 min after the stress period for the two corpora applicable—showing that audio features alone are sufficient for predicting cortisol, with audiovisual fusion to an extent improving such results. We also obtain a Root Mean Square Error (RMSE) of 38 and 22 BPM for continuous heart rate prediction on the two corpora where this information is available, and a normalised RMSE (NRMSE) of 0.120 for respiration prediction (−10: 10). Both of these continuous physiological signals show to be highly effective markers of stress (based on cortisol grouping analysis), both when available as ground truth and when predicted using speech. This contribution opens up new avenues for future exploration of these signals as proxies for stress in naturalistic settings.


Author(s):  
Anuranjan Pandey

Abstract: In the tropical jungle, hearing a species is considerably simpler than seeing it. The sounds of many birds and frogs may be heard if we are in the woods, but the bird cannot be seen. It is difficult in this these circumstances for the expert in identifying the many types of insects and harmful species that may be found in the wild. An audio-input model has been developed in this study. Intelligent signal processing is used to extract patterns and characteristics from the audio signal, and the output is used to identify the species. Sound of the birds and frogs vary according to their species in the tropical environment. In this research we have developed a deep learning model, this model enhances the process of recognizing the bird and frog species based on the audio features. The model achieved a high level of accuracy in recognizing the birds and the frog species. The Resnet model which includes block of simple and convolution neural network is effective in recognizing the birds and frog species using the sound of the animal. Above 90 percent of accuracy is achieved for this classification task. Keywords: Bird Frog Detection, Neural Network, Resnet, CNN.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jie Shan ◽  
Muhammad Talha

This article uses a multimodal smart music online teaching method combined with artificial intelligence to address the problem of smart music online teaching and to compensate for the shortcomings of the single modal classification method that only uses audio features for smart music online teaching. The selection of music intelligence models and classification models, as well as the analysis and processing of music characteristics, is the subjects of this article. It mainly studies how to use lyrics and how to combine audio and lyrics to intelligently classify music and teach multimodal and monomodal smart music online. In the online teaching of smart music based on lyrics, on the basis of the traditional wireless network node feature selection method, three parameters of frequency, concentration, and dispersion are introduced to adjust the statistical value of wireless network nodes, and an improved wireless network is proposed. After feature selection, the TFIDF method is used to calculate the weights, and then artificial intelligence is used to perform secondary dimensionality reduction on the lyrics. Experimental data shows that in the process of intelligently classifying lyrics, the accuracy of the traditional wireless network node feature selection method is 58.20%, and the accuracy of the improved wireless network node feature selection method is 67.21%, combined with artificial intelligence and improved wireless, the accuracy of the network node feature selection method is 69.68%. It can be seen that the third method has higher accuracy and lower dimensionality. In the online teaching of multimodal smart music based on audio and lyrics, this article improves the traditional fusion method for the problem of multimodal fusion and compares various fusion methods through experiments. The experimental results show that the improved classification effect of the fusion method is the best, reaching 84.43%, which verifies the feasibility and effectiveness of the method.


2021 ◽  
Vol 7 ◽  
pp. e785
Author(s):  
Liang Xu ◽  
Zaoyi Sun ◽  
Xin Wen ◽  
Zhengxi Huang ◽  
Chi-ju Chao ◽  
...  

Melody and lyrics, reflecting two unique human cognitive abilities, are usually combined in music to convey emotions. Although psychologists and computer scientists have made considerable progress in revealing the association between musical structure and the perceived emotions of music, the features of lyrics are relatively less discussed. Using linguistic inquiry and word count (LIWC) technology to extract lyric features in 2,372 Chinese songs, this study investigated the effects of LIWC-based lyric features on the perceived arousal and valence of music. First, correlation analysis shows that, for example, the perceived arousal of music was positively correlated with the total number of lyric words and the mean number of words per sentence and was negatively correlated with the proportion of words related to the past and insight. The perceived valence of music was negatively correlated with the proportion of negative emotion words. Second, we used audio and lyric features as inputs to construct music emotion recognition (MER) models. The performance of random forest regressions reveals that, for the recognition models of perceived valence, adding lyric features can significantly improve the prediction effect of the model using audio features only; for the recognition models of perceived arousal, lyric features are almost useless. Finally, by calculating the feature importance to interpret the MER models, we observed that the audio features played a decisive role in the recognition models of both perceived arousal and perceived valence. Unlike the uselessness of the lyric features in the arousal recognition model, several lyric features, such as the usage frequency of words related to sadness, positive emotions, and tentativeness, played important roles in the valence recognition model.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Chun Huang ◽  
Diao Shen

The music performance system works by identifying the emotional elements of music to control the lighting changes. However, if there is a recognition error, a good stage effect will not be able to create. Therefore, this paper proposes an intelligent music emotion recognition and classification algorithm in the music performance system. The first part of the algorithm is to analyze the emotional features of music, including acoustic features, melody features, and audio features. Then, the three kinds of features are combined together to form a feature vector set. In the latter part of the algorithm, it divides the feature vector set into training samples and test samples. The training samples are trained by using recognition and classification model based on the neural network. And then, the testing samples are input into the trained model, which is aiming to realize the intelligent recognition and classification of music emotion. The result shows that the kappa coefficient k values calculated by the proposed algorithm are greater than 0.75, which indicates that the recognition and classification results are consistent with the actual results, and the accuracy of recognition and classification is high. So, the research purpose is achieved.


2021 ◽  
Author(s):  
Javier Andreu-Perez ◽  
Humberto Perez-Espinosa ◽  
Eva Timonet ◽  
Mehrin Kiani ◽  
Manuel I. Girón-Pérez ◽  
...  

We seek to evaluate the detection performance of a rapid primary screening tool of Covid-19 solely based on the cough sound from 8,380 clinically validated samples with laboratory molecular-test (2,339 Covid-19 positive and 6,041 Covid-19 negative). Samples were clinically labelled according to the results and severity based on quantitative RT-PCR (qRT-PCR) analysis, cycle threshold and lymphocytes count from the patients. Our proposed generic method is a algorithm based on Empirical Mode Decomposition (EMD) with subsequent classification based on a tensor of audio features and deep artificial neural network classifier with convolutional layers called DeepCough'. Two different versions of DeepCough based on the number of tensor dimensions, i.e. DeepCough2D and DeepCough3D, have been investigated. These methods have been deployed in a multi-platform proof-of-concept Web App CoughDetect to administer this test anonymously. Covid-19 recognition results rates achieved a promising AUC (Area Under Curve) of 98.800.83%, sensitivity of 96.431.85%, and specificity of 96.201.74%, and 81.08%5.05% AUC for the recognition of three severity levels. Our proposed web tool and underpinning algorithm for the robust, fast, point-of-need identification of Covid-19 facilitates the rapid detection of the infection. We believe that it has the potential to significantly hamper the Covid-19 pandemic across the world.


2021 ◽  
Vol 8 (11) ◽  
Author(s):  
Ole Adrian Heggli ◽  
Jan Stupacher ◽  
Peter Vuust

The rhythm of human life is governed by diurnal cycles, as a result of endogenous circadian processes evolved to maximize biological fitness. Even complex aspects of daily life, such as affective states, exhibit systematic diurnal patterns which in turn influence behaviour. As a result, previous research has identified population-level diurnal patterns in affective preference for music. By analysing audio features from over two billion music streaming events on Spotify, we find that the music people listen to divides into five distinct time blocks corresponding to morning, afternoon, evening, night and late night/early morning. By integrating an artificial neural network with Spotify's API, we show a general awareness of diurnal preference in playlists, which is not present to the same extent for individual tracks. Our results demonstrate how music intertwines with our daily lives and highlight how even something as individual as musical preference is influenced by underlying diurnal patterns.


Sign in / Sign up

Export Citation Format

Share Document