machine listening
Recently Published Documents


TOTAL DOCUMENTS

34
(FIVE YEARS 13)

H-INDEX

4
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Javier Naranjo-Alcazar ◽  
Sergi Perez-Castanos ◽  
Maximo Cobos ◽  
Francesc J. Ferri ◽  
Pedro Zuccarello

Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening. The objective of this problem is to classify an audio clip into one of the predefined scenes using only the audio data. This problem has considerably progressed over the years in the different editions of DCASE. It usually has several subtasks that allow to tackle this problem with different approaches. The subtask presented in this report corresponds to a ASC problem that is constrained by the complexity of the model as well as having audio recorded from different devices, known as mismatch devices (real and simulated). The work presented in this report follows the research line carried out by the team in previous years. Specifically, a system based on two steps is proposed: a two-dimensional representation of the audio using the Gamamtone filter bank and a convolutional neural network using squeeze-excitation techniques. The presented system outperforms the baseline by about 17 percentage points.


2021 ◽  
Author(s):  
Khaled Koutini ◽  
Hamid Eghbal-zadeh ◽  
Florian Henkel ◽  
Jan Schlüter ◽  
Gerhard Widmer

Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.


Author(s):  
Domenico Napolitano ◽  
Renato Grieco

The paper investigates new machine listening technologies through a comparison of phenomenological and empirical/media-archeological approaches. While phenomenology associates listening with subjectivity, empiricism takes into account the technical operations involved with listening processes in both human and non-human apparatuses. Based on this theoretical framework, the paper undertakes a media-archeological investigation of two algorithms employed in copyright detection: “acoustic fi ngerprinting” and “audio watermarking”. In the technical operations of sound recognition algorithms, empirical analysis suggests the coexistence of a multiplicity of spatialities: from the “sound event”, which occurs in three-dimensional physical space, to its mathematical representation in vector space, and to the one-dimensional informational space of data processing and machine-to-machine communication. Recalling Deleuze’s defi nition of “the fold”, we defi ne these coexistent spatial dimensions in techno-culturally mediated sound as “the folded space” of machine listening. We go on to argue that the issue of space in machine listening consists of the virtually infi nite variability of the sound event being subjected to automatic recognition. The diffi culty lies in conciliating the theoretically enduring information transmitted by sound with the contingent manifestation of sound affected by space. To make machines able to deal with the site-specifi city of sound, recognition algorithms need to reconstruct the three-dimensional space on a signal processing level, in a sort of reverse-engineering of the sound phenomenon that recalls the concept of “implicit sonicity” defi ned by Wolfgang Ernst. While the metaphors and social representations adopted to describe machine listening are often anthropomorphic – and the very term “listening”, when referring to numerical operations, can be seen as a metaphor in itself – we argue that both human listening and machine listening are co-defi ned in a socio-technical network, in which the listening space no longer coincides with the position of the listening subject, but is negotiated between human and nonhuman agencies.


2021 ◽  
Vol 25 ◽  
pp. 233121652110461
Author(s):  
Björn Schuller ◽  
Alice Baird ◽  
Alexander Gebhard ◽  
Shahin Amiriparian ◽  
Gil Keren ◽  
...  

Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human listener is quickly able to interpret layers of sound at a single time-point, with each layer varying in characteristics such as location, state, and trait. Currently, integrated machine listening approaches, on the other hand, will mainly recognise only single events. In this context, this contribution aims to provide key insights and approaches, which can be applied in computer audition to achieve the goal of a more holistic intelligent understanding system, as well as identifying challenges in reaching this goal. We firstly summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio interpretation, decomposition, understanding, as well as ontologisation. We then present an agent-based approach for integrating these concepts as a holistic audio understanding system. Based on this, concluding, avenues are given towards reaching the ambitious goal of ‘holistic human-parity’ machine listening abilities.


2020 ◽  
pp. 69-78
Author(s):  
Stefano Fasciani

Expressive sonic interaction with sound synthesizers requires the control of a continuous and high dimensional space. Further, the relationship between synthesis variables and timbre of the generated sound is typically complex or unknown to users. In previous works, we presented an unsupervised mapping method based on machine listening and machine learning techniques, which addresses these challenges by providing a low-dimensional and perceptually related timbre control space. The mapping maximizes the breadth of the explorable sonic space covered by the sound synthesizer, and minimizes possible timbre losses due to the low­-dimensional control. The mapping is generated automatically by a system requiring little input from users. In this paper we present an improved method and an optimized implementation that drastically reduce the time for timbre analysis and mapping computation. Here we introduce the use of the extreme learning machines for the regression from control to timbre spaces, and an interactive approach for the analysis of the synthesizer sonic response, performed as users explore the parameters of the instrument. This work is implemented in a generic and open-source tool that enables the computation of ad hoc synthesis mappings through timbre spaces, facilitating and speeding up the workflow to get a customized sonic control system.


2020 ◽  
Vol 24 (7) ◽  
pp. 2082-2092 ◽  
Author(s):  
Fengquan Dong ◽  
Kun Qian ◽  
Zhao Ren ◽  
Alice Baird ◽  
Xinjian Li ◽  
...  

Author(s):  
Saumitra Mishra ◽  
Emmanouil Benetos ◽  
Bob L. T. Sturm ◽  
Simon Dixon
Keyword(s):  

2020 ◽  
Vol 1 ◽  
pp. 6
Author(s):  
Jörn Anemüller

Multi-channel acoustic source localization evaluates direction-dependentinter-microphone differences in order to estimate the position of an acousticsource embedded in an interfering sound field. We here investigate a deep neuralnetwork (DNN) approach to source localization that improves on previous workwith learned, linear support-vector-machine localizers. DNNs with depthsbetween 4 and 15 layers were trained to predict azimuth direction of targetspeech in 72 directional bins of width 5 degree, embedded in an isotropic,multi-speech-source noise field. Several system parameters were varied, inparticular number of microphones in the bilateral hearing aid scenario wasset to 2, 4, and 6, respectively. Results show that DNNs provide a clear improvement inlocalization performance over a linear classifier reference system.Increasing the number of microphones from 2 to 4 results in a larger increase ofperformance for the DNNs than for the linear system. However, 6 microphonesprovide only a small additional gain. The DNN architectures perform betterwith 4 microphones than the linear approach does with 6 microphones, thusindicating that location-specific information in source-interference scenariosis encoded non-linearly in the sound field.


Sign in / Sign up

Export Citation Format

Share Document