audio clip
Recently Published Documents


TOTAL DOCUMENTS

29
(FIVE YEARS 15)

H-INDEX

6
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Javier Naranjo-Alcazar ◽  
Sergi Perez-Castanos ◽  
Maximo Cobos ◽  
Francesc J. Ferri ◽  
Pedro Zuccarello

Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening. The objective of this problem is to classify an audio clip into one of the predefined scenes using only the audio data. This problem has considerably progressed over the years in the different editions of DCASE. It usually has several subtasks that allow to tackle this problem with different approaches. The subtask presented in this report corresponds to a ASC problem that is constrained by the complexity of the model as well as having audio recorded from different devices, known as mismatch devices (real and simulated). The work presented in this report follows the research line carried out by the team in previous years. Specifically, a system based on two steps is proposed: a two-dimensional representation of the audio using the Gamamtone filter bank and a convolutional neural network using squeeze-excitation techniques. The presented system outperforms the baseline by about 17 percentage points.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Gundeep Singh ◽  
Sahil Sharma ◽  
Vijay Kumar ◽  
Manjit Kaur ◽  
Mohammed Baz ◽  
...  

The process of detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker, is defined as spoken language identification (SLID). The considerable task is to recognize the features that can distinguish between languages clearly and efficiently. The model uses audio files and converts those files into spectrogram images. It applies the convolutional neural network (CNN) to bring out main attributes or features to detect output easily. The main objective is to detect languages out of English, French, Spanish, and German, Estonian, Tamil, Mandarin, Turkish, Chinese, Arabic, Hindi, Indonesian, Portuguese, Japanese, Latin, Dutch, Portuguese, Pushto, Romanian, Korean, Russian, Swedish, Tamil, Thai, and Urdu. An experiment was conducted on different audio files using the Kaggle dataset named spoken language identification. These audio files are comprised of utterances, each of them spanning over a fixed duration of 10 seconds. The whole dataset is split into training and test sets. Preparatory results give an overall accuracy of 98%. Extensive and accurate testing show an overall accuracy of 88%.


2021 ◽  
Author(s):  
Xiangjian Liu ◽  
Yishan Zou ◽  
Yu Sun

Dogs have the tendency to bark at loud noises that they perceive as an intruder or a threat, and the hostile barking can often last up to hours depending on the duration of such noise. These barking sessions are unnecessary and negatively impact the quality of life of the others in your community, causing annoyance to your neighbors [1]. Having the rights to file noise complaints to the Home Owners Association, potentially resulting in fines or even the removal of the pet [2]. In this paper, we will discuss the development of an algorithm that takes in audio inputs through a microphone, then processes the audio and identifies that the audio clip is dog barks through machine learning, and ultimately sends the notification to the user. By implementing our application to the everyday life of dog owners, it allows them to accurately determine the status of their dog in real-time with minimal false reports.


Author(s):  
R Santhoshi

While learning a new language through the internet or applications, a lot of them focus on teaching words and sentences and do not concentrate on the pronouncing ability of the users. Even though many speakers are proficient in a language, their pronunciation might be influenced by their native language. For people who are interested in improving their pronunciation capabilities this proposed system was introduced. This system is primarily focused on improving the pronunciation of English words and sentences for non-native speakers i.e., for whom English is a second language. For a given audio clip, we scale the audio and extract the features, input the features to the model developed and the output of the model gives the phonemes that are spoken in the clip. Many models detect phonemes and various methods have been proposed but the main reason for choosing deep learning is that the learning and the features that we tend to oversee or overlook are picked up by the model provided the dataset is balanced and the model is built properly. The features to be considered varies for every speech processing project and through previous research work and through trial and error we choose the features that work best for us. Comparing the phonemes with the actual phonemes present, we can give the speaker which part of their speech they need to work on. Based on the phoneme, feedback is given on how to improve their pronunciation.


2021 ◽  
Vol 15 (02) ◽  
pp. 143-160
Author(s):  
Ayşegül Özkaya Eren ◽  
Mustafa Sert

Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder–decoder-based models without considering semantic information. To fill this gap, we present a novel encoder–decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder–decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance.


Author(s):  
Sichen Liu ◽  
Feiran Yang ◽  
Yin Cao ◽  
Jun Yang

AbstractSound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0250173
Author(s):  
Sadia Sultana ◽  
M. Shahidur Rahman ◽  
M. Reza Selim ◽  
M. Zafar Iqbal

SUBESCO is an audio-only emotional speech corpus for Bangla language. The total duration of the corpus is in excess of 7 hours containing 7000 utterances, and it is the largest emotional speech corpus available for this language. Twenty native speakers participated in the gender-balanced set, each recording of 10 sentences simulating seven targeted emotions. Fifty university students participated in the evaluation of this corpus. Each audio clip of this corpus, except those of Disgust emotion, was validated four times by male and female raters. Raw hit rates and unbiased rates were calculated producing scores above chance level of responses. Overall recognition rate was reported to be above 70% for human perception tests. Kappa statistics and intra-class correlation coefficient scores indicated high-level of inter-rater reliability and consistency of this corpus evaluation. SUBESCO is an Open Access database, licensed under Creative Common Attribution 4.0 International, and can be downloaded free of charge from the web link: https://doi.org/10.5281/zenodo.4526477.


2021 ◽  
Vol 9 ◽  
Author(s):  
Ahnjili ZhuParris ◽  
Matthijs D. Kruizinga ◽  
Max van Gent ◽  
Eva Dessing ◽  
Vasileios Exadaktylos ◽  
...  

Introduction: The duration and frequency of crying of an infant can be indicative of its health. Manual tracking and labeling of crying is laborious, subjective, and sometimes inaccurate. The aim of this study was to develop and technically validate a smartphone-based algorithm able to automatically detect crying.Methods: For the development of the algorithm a training dataset containing 897 5-s clips of crying infants and 1,263 clips of non-crying infants and common domestic sounds was assembled from various online sources. OpenSMILE software was used to extract 1,591 audio features per audio clip. A random forest classifying algorithm was fitted to identify crying from non-crying in each audio clip. For the validation of the algorithm, an independent dataset consisting of real-life recordings of 15 infants was used. A 29-min audio clip was analyzed repeatedly and under differing circumstances to determine the intra- and inter- device repeatability and robustness of the algorithm.Results: The algorithm obtained an accuracy of 94% in the training dataset and 99% in the validation dataset. The sensitivity in the validation dataset was 83%, with a specificity of 99% and a positive- and negative predictive value of 75 and 100%, respectively. Reliability of the algorithm appeared to be robust within- and across devices, and the performance was robust to distance from the sound source and barriers between the sound source and the microphone.Conclusion: The algorithm was accurate in detecting cry duration and was robust to various changes in ambient settings.


2021 ◽  
Author(s):  
Jing Yuan ◽  
Zijie Wang ◽  
Dehe Yang ◽  
Qiao Wang ◽  
Zeren Zima ◽  
...  

<p>Lightning whistlers, found frequently in electromagnetic satellite observation, are the important tool to study electromagnetic environment of the earth space. With the increasing data from electromagnetic satellites, a considerable amount of time and human efforts are needed to detect lightning whistlers from these tremendous data. In recent years, algorithms for lightning whistlers automatic detection have been conducted. However, these methods can only work in the time-frequency profile (image) of the electromagnetic satellites data with two major limitations: vast storage memory for the time-frequency profile (image) and expensive computation for employing the methods to detect automatically the whistler from the time-frequency profile. These limitations hinder the methods work efficiently on ZH-1 satellite. To overcome the limitations and realize the real-time whistler detection automatically on board satellite, we propose a novel algorithm for detecting lightning whistler from the original observed data without transforming it to the time-frequency profile (image).</p><p>The motivation is that the frequency of lightning whistler is in the audio frequency range. It encourages us to utilize the speech recognition techniques to recognize the whistler in the original data \of SCM VLF Boarded on ZH-1. Firstly, we averagely move a 0.16 seconds window on the original data to obtain the patch data as the audio clip. Secondly, we extract the Mel-frequency cepstral coefficients (MFCCs) of the patch data as a type of cepstral representation of the audio clip. Thirdly, the MFCCs are input to the Long Short-Term Memory (LSTM) recurrent neutral networks to classification. To evaluate the proposed method, we construct the dataset composed of 10000 segments of SCM wave data observed from ZH-1 satellite(5000 segments which involving whistler and 5000 segments without any whistler). The proposed method can achieve 84% accuracy, 87% in recall, 85.6% in F1score.Furthermore, it can save more than 126.7MB and 0.82 seconds compared to the method employing the YOLOv3 neutral network for detecting whistler on each time-frequency profile.</p><p> </p><p>Key words: ZH-1 satellite, SCM,lightning whistler, MFCC, LSTM</p>


2020 ◽  
Vol 25 (4) ◽  
pp. 357-367
Author(s):  
Fayel Mustafiz ◽  
Dawn D. Dugan

Students with mental illness can feel stigmatized by their peers and may also have less perceived social support. However, it is thought that people are more likely to view someone more favorably if they perceive them as part of their in-group when sharing a common identity. Thus, an online survey was administered to 152 undergraduate students to investigate whether high in-group identification versus low in-group identification will lead to a more favorable view of a peer regardless of their mental health state and if a peer with stress would still be favored over a peer with mental illness. First, participants rated group identification with a hypothetical peer describing their Hunter College, CUNY experience in an audio clip. Then, participants heard the peer either reveal a mental health state of mental illness or stress. Finally, they rated perceived similarity and social distance toward the peer. Results of a factorial multivariate analysis of variance indicated significant main effects for both in-group identification, F(2, 147) = 8.01, p < .001, partial η2 = .10, and the peer’s mental health state, F(2, 147) = 8.00, p = .001, partial η2 = .10. Although the peer with mental illness was viewed less favorably than the peer with stress, irrespective of group identification, high in-group identification still led to a more positive evaluation of the peer than low in-group identification. These results are important for understanding how increasing awareness of group identification may reduce stigma toward students with mental illness and ultimately reduce barriers to care.


Sign in / Sign up

Export Citation Format

Share Document