Machine Audition
Latest Publications


TOTAL DOCUMENTS

20
(FIVE YEARS 0)

H-INDEX

4
(FIVE YEARS 0)

Published By IGI Global

9781615209194, 9781615209200

2010 ◽  
pp. 447-473
Author(s):  
Pedro Gómez-Vilda ◽  
José Manuel Ferrández-Vicente ◽  
Victoria Rodellar-Biarge ◽  
Rafael Martínez-Olalla ◽  
Víctor Nieto-Lluis ◽  
...  

Current trends in the search for improvements in well-established technologies imitating human abilities, as speech perception, try to find inspiration in the explanation of certain capabilities hidden in the natural system which are not yet well understood. A typical case is that of speech recognition, where the semantic gap going from spectral time-frequency representations to the symbolic translation into phonemes and words, and the construction of morpho-syntactic and semantic structures find many hidden phenomena not well understood yet. The present chapter is intended to explore some of these facts at a simplifying level under two points of view: that of top-down analysis provided from speech perception, and the symmetric from bottom-up synthesis provided by the biological architecture of auditory pathways. An application-driven design of a Neuromorphic Speech Processing Architecture is presented and its performance analyzed. Simulation details provided by a parallel implementation of the architecture in a supercomputer will be also shown and discussed.


2010 ◽  
pp. 297-316
Author(s):  
Ruohua Zhou ◽  
Josh D Reiss

Music onset detection plays an essential role in music signal processing and has a wide range of applications. This chapter provides a step by step introduction to the design of music onset detection algorithms. The general scheme and commonly-used time-frequency analysis for onset detection are introduced. Many methods are reviewed, and some typical energy-based, phase-based, pitch-based and supervised learning methods are described in detail. The commonly used performance measures, onset annotation software, public database and evaluation methods are introduced. The performance difference between energy-based and pitch-based method is discussed. The future research directions for music onset detection are also described.


2010 ◽  
pp. 266-296 ◽  
Author(s):  
Cédric Févotte

Nonnegative matrix factorization (NMF) is a popular linear regression technique in the fields of machine learning and signal/image processing. Much research about this topic has been driven by applications in audio. NMF has been for example applied with success to automatic music transcription and audio source separation, where the data is usually taken as the magnitude spectrogram of the sound signal, and the Euclidean distance or Kullback-Leibler divergence are used as measures of fit between the original spectrogram and its approximate factorization. In this chapter the authorsgive evidence of the relevance of considering factorization of the power spectrogram, with the Itakura-Saito (IS) divergence. Indeed, IS-NMF is shown to be connected to maximum likelihood inference of variance parameters in a well-defined statistical model of superimposed Gaussian components and this model is in turn shown to be well suited to audio. Furthermore, the statistical setting opens doors to Bayesian approaches and to a variety of computational inference techniques. They discuss in particular model order selection strategies and Markov regularization of the activation matrix, to account for time-persistence in audio. This chapter also discusses extensions of NMF to the multichannel case, in both instantaneous or convolutive recordings, possibly underdetermined. The authors present in particular audio source separation results of a real stereo musical excerpt.


2010 ◽  
pp. 246-265 ◽  
Author(s):  
Andrew Nesbit ◽  
Maria G. Jafar ◽  
Emmanuel Vincent ◽  
Mark D. Plumbley

The authors address the problem of audio source separation, namely, the recovery of audio signals from recordings of mixtures of those signals. The sparse component analysis framework is a powerful method for achieving this. Sparse orthogonal transforms, in which only few transform coefficients differ significantly from zero, are developed; once the signal has been transformed, energy is apportioned from each transform coefficient to each estimated source, and, finally, the signal is reconstructed using the inverse transform. The overriding aim of this chapter is to demonstrate how this framework, as exemplified here by two different decomposition methods which adapt to the signal to represent it sparsely, can be used to solve different problems in different mixing scenarios. To address the instantaneous (neither delays nor echoes) and underdetermined (more sources than mixtures) mixing model, a lapped orthogonal transform is adapted to the signal by selecting a basis from a library of predetermined bases. This method is highly related to the windowing methods used in the MPEG audio coding framework. In considering the anechoic (delays but no echoes) and determined (equal number of sources and mixtures) mixing case, a greedy adaptive transform is used based on orthogonal basis functions that are learned from the observed data, instead of being selected from a predetermined library of bases. This is found to encode the signal characteristics, by introducing a feedback system between the bases and the observed data. Experiments on mixtures of speech and music signals demonstrate that these methods give good signal approximations and separation performance, and indicate promising directions for future research.


2010 ◽  
pp. 186-206
Author(s):  
Saeid Sanei ◽  
Bahador Makkiabadi

Tensor factorization (TF) is introduced as a powerful tool for solving multi-way problems. As an effective and major application of this technique, separation of sound particularly speech signal sources from their corresponding convolutive mixtures is described and the results are demonstrated. The method is flexible and can easily incorporate all possible parameters or factors into the separation formulation. As a consequence of that fewer assumptions (such as uncorrelatedness and independency) will be required. The new formulation allows further degree of freedom to the original parallel factor analysis (PARAFAC) problem in which the scaling and permutation problems of the frequency domain blind source separation (BSS) can be resolved. Based on the results of experiments using real data in a simulated medium, it has been concluded that compared to conventional frequency domain BSS methods, both objective and subjective results are improved when the proposed algorithm is used.


2010 ◽  
pp. 126-161 ◽  
Author(s):  
Banu Günel ◽  
Hüseyin Hacihabiboglu

Automatic sound source localization has recently gained interest due to its various applications that range from surveillance to hearing aids, and teleconferencing to human computer interaction. Automatic sound source localization may refer to the process of determining only the direction of a sound source, which is known as the direction-of-arrival estimation, or also its distance in order to obtain its coordinates. Various methods have previously been proposed for this purpose. Many of these methods use the time and level differences between the signals captured by each element of a microphone array. An overview of these conventional array processing methods is given and the factors that affect their performance are discussed. The limitations of these methods affecting real-time implementation are highlighted. An emerging source localization method based on acoustic intensity is explained. A theoretical evaluation of different microphone array geometries is given. Two well-known problems, localization of multiple sources and localization of acoustic reflections, are addressed.


2010 ◽  
pp. 162-185 ◽  
Author(s):  
Emmanuel Vincent ◽  
Maria G. Jafari ◽  
Samer A. Abdallah ◽  
Mark D. Plumbley ◽  
Mike E. Davies

Most sound scenes result from the superposition of several sources, which can be separately perceived and analyzed by human listeners. Source separation aims to provide machine listeners with similar skills by extracting the sounds of individual sources from a given scene. Existing separation systems operate either by emulating the human auditory system or by inferring the parameters of probabilistic sound models. In this chapter, the authors focus on the latter approach and provide a joint overview of established and recent models, including independent component analysis, local time-frequency models and spectral template-based models. They show that most models are instances of one of the following two general paradigms: linear modeling or variance modeling. They compare the merits of either paradigm and report objective performance figures. They also,conclude by discussing promising combinations of probabilistic priors and inference algorithms that could form the basis of future state-of-the-art systems.


2010 ◽  
pp. 107-125
Author(s):  
Syed Mohsen Naqvi ◽  
Yonggang Zhang ◽  
Miao Yu ◽  
Jonathon A. Chambers

A novel multimodal solution is proposed to solve the problem of blind source separation (BSS) of moving sources. Since for moving sources the mixing filters are time varying, therefore, the unmixing filters should also be time varying and can be difficult to track in real time. In this solution the visual modality is utilized to facilitate the separation of moving sources. The movement of the sources is detected by a relatively simplistic 3-D tracker based on video cameras. The tracking process is based on particle filtering which provides robust tracking performance. Positions and velocities of the sources are obtained from the 3-D tracker and if the sources are moving, a beamforming algorithm is used to perform real time speech enhancement and provide separation of the sources. Experimental results show that by utilizing the visual modality, a good BSS performance for moving sources in a low reverberant environment can be achieved.


2010 ◽  
pp. 398-423 ◽  
Author(s):  
Sanaul Haq ◽  
Philip J.B. Jackson

Recent advances in human-computer interaction technology go beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. An important augmentation, and potential source of feedback, comes from recognizing the user‘s expressed emotion or affect. This chapter presents an overview of research efforts to classify emotion using different modalities: audio, visual and audio-visual combined. Theories of emotion provide a framework for defining emotional categories or classes. The first step, then, in the study of human affect recognition involves the construction of suitable databases. The authorsdescribe fifteen audio, visual and audio-visual data sets, and the types of feature that researchers have used to represent the emotional content. They discuss data-driven methods of feature selection and reduction, which discard noise and irrelevant information to maximize the concentration of useful information. They focus on the popular types of classifier that are used to decide to which emotion class a given example belongs, and methods of fusing information from multiple modalities. Finally, the authors point to some interesting areas for future investigation in this field, and conclude.


Sign in / Sign up

Export Citation Format

Share Document