Skip to main content

About this book

This book provides the first comprehensive overview of the fascinating topic of audio source separation based on non-negative matrix factorization, deep neural networks, and sparse component analysis.

The first section of the book covers single channel source separation based on non-negative matrix factorization (NMF). After an introduction to the technique, two further chapters describe separation of known sources using non-negative spectrogram factorization, and temporal NMF models. In section two, NMF methods are extended to multi-channel source separation. Section three introduces deep neural network (DNN) techniques, with chapters on multichannel and single channel separation, and a further chapter on DNN based mask estimation for monaural speech separation. In section four, sparse component analysis (SCA) is discussed, with chapters on source separation using audio directional statistics modelling, multi-microphone MMSE-based techniques and diffusion map methods.

The book brings together leading researchers to provide tutorial-like and in-depth treatments on major audio source separation topics, with the objective of becoming the definitive source for a comprehensive, authoritative, and accessible treatment. This book is written for graduate students and researchers who are interested in audio source separation techniques based on NMF, DNN and SCA.

Table of Contents


Chapter 1. Single-Channel Audio Source Separation with NMF: Divergences, Constraints and Algorithms

Spectral decomposition by nonnegative matrix factorisation (NMF) has become state-of-the-art practice in many audio signal processing tasks, such as source separation, enhancement or transcription. This chapter reviews the fundamentals of NMF-based audio decomposition, in unsupervised and informed settings. We formulate NMF as an optimisation problem and discuss the choice of the measure of fit. We present the standard majorisation-minimisation strategy to address optimisation for NMF with the common $$\beta $$β-divergence, a family of measures of fit that takes the quadratic cost, the generalised Kullback-Leibler divergence and the Itakura-Saito divergence as special cases. We discuss the reconstruction of time-domain components from the spectral factorisation and present common variants of NMF-based spectral decomposition: supervised and informed settings, regularised versions, temporal models.

Cédric Févotte, Emmanuel Vincent, Alexey Ozerov

Chapter 2. Separation of Known Sources Using Non-negative Spectrogram Factorisation

This chapter presents non-negative spectrogram factorisation (NMF) techniques which can be used to separate sources in the cases where source-specific training material is available in advance. We first present the basic NMF formulation for sound mixtures and then present criteria and algorithms for estimating the model parameters. We introduce selected methods for training the NMF source models by using either vector quantisation, convexity constraints, archetypal analysis, or discriminative methods. We also explain how the learned dictionaries can be adapted to deal with mismatches between the training data and usage scenario. We present also how semi-supervised learning can be used to deal with unknown noise sources within a mixture and finally we introduce a coupled NMF method which can be used to model large temporal context while retaining low algorithmic latency.

Tuomas Virtanen, Tom Barker

Chapter 3. Dynamic Non-negative Models for Audio Source Separation

As seen so far, non-negative models can be quite powerful when it comes to resolving mixtures of sounds. However, in such models we often ignore temporal information, instead focusing on resolving each incoming spectrum independently. In this chapter we will present some methods that learn to incorporate the temporal aspects of sounds and use that information to perform improved separation. We will show three such models, a conlvolutive model that learns fixed temporal features, a hidden Markov model that learns state transitions and can incorporate language information, and finally a continuous dynamical model that learns how sounds evolve over time and is able to resolve cases where static information is not enough.

Paris Smaragdis, Gautham Mysore, Nasser Mohammadiha

Chapter 4. An Introduction to Multichannel NMF for Audio Source Separation

This chapter introduces multichannel nonnegative matrix factorization (NMF) methods for audio source separation. All the methods and some of their extensions are introduced within a more general local Gaussian modeling (LGM) framework. These methods are very attractive since allow combining spatial and spectral cues in a joint and principal way, but also are natural extensions and generalizations of many single-channel NMF-based methods to the multichannel case. The chapter introduces the spectral (NMF-based) and spatial models, as well as the way to combine them within the LGM framework. Model estimation criteria and algorithms are described as well, while going deeper into details of some of them.

Alexey Ozerov, Cédric Févotte, Emmanuel Vincent

Chapter 5. General Formulation of Multichannel Extensions of NMF Variants

Blind source separation (BSS) is generally a mathematically ill-posed problem that involves separating out individual source signals from microphone array inputs. The frequency domain BSS approach is particularly notable in that it provides the flexibility needed to exploit various models for the time-frequency representations of source signals and/or array responses. Many frequency domain BSS approaches can be categorized according to the way in which the source power spectrograms and/or the mixing process are modeled. For source power spectrogram modeling, the non-negative matrix factorization (NMF) model and its variants have recently proved very powerful. For mixing process modeling, one reasonable way involves introducing a plane wave assumption so that the spatial covariances of each source can be described explicitly using the direction of arrival (DOA). This chapter provides a general formulation of the frequency domain BSS that makes it possible to incorporate the models for the source power spectrogram and the source spatial covariance matrix. Through this formulation, we reveal the relationship between the state-of-the-art BSS approaches. We further show that combining these models allows us to solve the problems of source separation, DOA estimation, dereverberation, and voice activity detection in a unified manner.

Hirokazu Kameoka, Hiroshi Sawada, Takuya Higuchi

Chapter 6. Determined Blind Source Separation with Independent Low-Rank Matrix Analysis

In this chapter, we address the determined blind source separation problem and introduce a new effective method of unifying independent vector analysis (IVA) and nonnegative matrix factorization (NMF). IVA is a state-of-the-art technique that utilizes the statistical independence between source vectors. However, since the source model in IVA is based on a spherically symmetric multivariate distribution, IVA cannot utilize the characteristics of specific spectral structures such as various sounds appearing in music signals. To solve this problem, we introduce NMF as the source model in IVA to capture the spectral structures. Since this approach is a natural extension of the source model from a vector to a low-rank matrix represented by NMF, the new method is called independent low-rank matrix analysis (ILRMA). We also reveal the relationship between IVA, ILRMA, and multichannel NMF (MNMF), namely, IVA and ILRMA are identical to a special case of MNMF, which employs a rank-1 spatial model. Experimental results show the efficacy of ILRMA compared with IVA and MNMF in terms of separation accuracy and convergence speed.

Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, Hiroshi Saruwatari

Chapter 7. Deep Neural Network Based Multichannel Audio Source Separation

This chapter presents a multichannel audio source separation framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. Different design choices and their impact on the performance are discussed. They include the cost functions for DNN training, the number of parameter updates, the use of multiple DNNs, and the use of weighted parameter updates. Finally, we present its application to a speech enhancement task and a music separation task. The experimental results show the benefit of the multichannel DNN-based approach over a single-channel DNN-based approach and the multichannel nonnegative matrix factorization based iterative EM framework.

Aditya Arie Nugraha, Antoine Liutkus, Emmanuel Vincent

Chapter 8. Efficient Source Separation Using Bitwise Neural Networks

Efficiency is one of the key issues in single-channel source separation systems due to the fact that they are often employed for real-time processing. More computationally demanding approaches tend to produce better results, but often not fast enough to be deployed in practical systems. For example, as opposed to the iterative separation algorithms using source-specific dictionaries, a Deep Neural Network (DNN) performs separation via an iteration-free feedforward process. However, even the feedforward process can be very complex depending on the size of the network. In this chapter, we introduce Bitwise Neural Networks (BNN) as an extremely compact form of neural networks, whose feedforward pass uses only efficient bitwise operations (e.g. XNOR instead of multiplication) on binary weight matrices and quantized input signals. As a result, we show that BNNs can perform denoising with a negnigible loss of quality as compared to a corresponding network with the same structure, while reducing the network complexity significantly.

Minje Kim, Paris Smaragdis

Chapter 9. DNN Based Mask Estimation for Supervised Speech Separation

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

Jitong Chen, DeLiang Wang

Chapter 10. Informed Spatial Filtering Based on Constrained Independent Component Analysis

In this work, we present a linearly constrained signal extraction algorithm which is based on a Minimum Mutual Information (MMI) criterion that allows to exploit the three fundamental properties of speech and audio signals: Nonstationarity, Nonwhiteness, and Nongaussianity. Hence, the proposed method is very well suited for signal processing of nonstationary nongaussian broadband signals like speech. Furthermore, from the linearly constrained MMI approach, we derive an efficient realization in a (GSC) structure. To estimate the relative transfer functions between the microphones, which are needed for the set of linear constraints, we use an informed time-domain independent component analysis algorithm, which exploits some coarse direction-of-arrival information of the target source. As a decisive advantage, this simplifies the otherwise challenging control mechanism for simultaneous adaptation of the GSC’s blocking matrix und interference and noise canceler coefficients. Finally, we establish relations between the proposed method and other well-known multichannel linear filter approaches for signal extraction based on second-order-statistics, and demonstrate the effectiveness of the proposed signal extraction method in a multispeaker scenario.

Hendrik Barfuss, Klaus Reindl, Walter Kellermann

Chapter 11. Recent Advances in Multichannel Source Separation and Denoising Based on Source Sparseness

This chapter deals with multichannel source separation and denoising based on sparseness of source signals in the time-frequency domain. In this approach, time-frequency masks are typically estimated based on clustering of source location features, such as time and level differences between microphones. In this chapter, we describe the approach and its recent advances. Especially, we introduce a recently proposed clustering method, observation vector clustering, which has attracted attention for its effectiveness. We introduce algorithms for observation vector clustering based on a complex Watson mixture model (cWMM), a complex Bingham mixture model (cBMM), and a complex Gaussian mixture model (cGMM). We show through experiments the effectiveness of observation vector clustering in source separation and denoising.

Nobutaka Ito, Shoko Araki, Tomohiro Nakatani

Chapter 12. Multimicrophone MMSE-Based Speech Source Separation

Beamforming methods using a microphone array successfully utilize spatial diversity for speech separation and noise reduction. Adaptive design of the beamformer based on various minimum mean squared error (MMSE) criteria significantly improves performance compared to fixed, and data-independent design. These criteria differ in their considerations to noise minimization and desired speech distortion. Three common data-dependent beamformers, namely, matched filter (MF), MWF and LCMV are presented and analyzed. Estimation methods for implementing the various beamformers are surveyed. Simple examples of applying the various beamformers to simulated narrowband signals in an anechoic environment and to speech signals in a real-life reverberant environment are presented and discussed.

Shmulik Markovich-Golan, Israel Cohen, Sharon Gannot

Chapter 13. Musical-Noise-Free Blind Speech Extraction Based on Higher-Order Statistics Analysis

In this chapter, we introduce a musical-noise-free blind speech extraction method using a microphone array for application to nonstationary noise. In the recent noise reduction study, it was found that optimized iterative spectral subtraction (SS) results in speech enhancement with almost no musical noise generation, but this method is valid only for stationary noise. The method presented in this chapter consists of iterative blind dynamic noise estimation by, e.g., independent component analysis (ICA) or multichannel Wiener filtering, and musical-noise-free speech extraction by modified iterative SS, where multiple iterative SS is applied to each channel while maintaining the multichannel property reused for the dynamic noise estimators. Also, in relation to the method, we discuss the justification of applying ICA to signals nonlinearly distorted by SS. From objective and subjective evaluations simulating a real-world hands-free speech communication system, we reveal that the method outperforms the conventional speech enhancement methods.

Hiroshi Saruwatari, Ryoichi Miyazaki

Chapter 14. Audio-Visual Source Separation with Alternating Diffusion Maps

In this chapter we consider the separation of multiple sound sources of different types including multiple speakersMultiple speakers and transients, which are measured by a single microphone and by a video camera. We address the problem of separating a particular sound source from all other sources focusing specifically on obtaining an underlying representationUnderlying representation of it while attenuating all other sources. By pointing the video camera merely to the desired sound source, the problem becomes equivalent to extracting the common source to the audio and the video modalities while ignoring the other sources. We use a kernel-based method, which is particularly designed for this task, providing an underlying representation of the common source. We demonstrate the usefulness of the obtained representation for the activity detection of the common source and discuss how it may be further used for source separation.

David Dov, Ronen Talmon, Israel Cohen


Additional information