Elsevier

Neurocomputing

Volume 72, Issues 1–3, December 2008, Pages 79-87
Neurocomputing

Efficient Bayesian inference for harmonic models via adaptive posterior factorization

https://doi.org/10.1016/j.neucom.2007.12.050Get rights and content

Abstract

Harmonic sinusoidal models are an essential tool for music audio signal analysis. Bayesian harmonic models are particularly interesting, since they allow the joint exploitation of various priors on the model parameters. However existing inference methods often rely on specific prior distributions and remain computationally demanding for realistic data. In this article, we investigate a generic inference method based on approximate factorization of the joint posterior into a product of independent distributions on small subsets of parameters. We discuss the conditions under which this factorization holds true and propose two criteria to choose these subsets adaptively. We evaluate the resulting performance experimentally for the task of multiple pitch estimation using different levels of factorization.

Introduction

Music and speech involve different types of sounds, including periodic, transient and noisy sounds. Short-term stationary periodic sounds composed of sinusoidal partials at harmonic or near-harmonic frequencies are perceptually essential, since they contain most of the energy of musical notes and vowels. Harmonicity means that at each instant the frequencies of the partials are multiples of a single frequency called the fundamental frequency. Estimating the periodic sounds underlying a given signal, i.e. estimating their fundamental frequencies and the amplitudes and phases of their partials, is required or useful for many applications, such as speech prosody analysis [10], multiple pitch estimation and instrument recognition [11] and low bit-rate compression [20]. This problem is particularly difficult for polyphonic signals, i.e. signals containing several concurrent periodic sounds, since different periodic sounds may exhibit partials overlapping at the same frequencies.

Existing methods for polyphonic fundamental frequency estimation are often based on one of two approaches [11]: either validation of fundamental frequency candidates given by the peaks of a short-term auto-correlation function [8], [22], [15] or inference of the hidden states of a probabilistic model of the signal short-term power spectrum based on learned template spectra [14], [18], [6]. These approaches have achieved limited performance on complex polyphonic signals so far [11], [15]. Moreover neither approach provides estimates for the amplitudes and phases of the partials, which are needed for musical instrument recognition or low bit-rate compression.

A promising way to address these issues is to rely on a probabilistic model of the signal waveform incorporating various prior knowledge. Two families of such models have been proposed in the literature for music signals. One family introduced in [4], [3] models each musical note signal in state-space form by a discrete fundamental frequency and a fixed number of damped oscillators at harmonic frequencies with independent transition noises. Decoding is achieved either via linear Kalman filtering or variational approximation [12], depending whether the damping factors are fixed or subject to additional transition noises. These inference methods restrict the prior distribution of the transition noises to be Gaussian or from a class of conjugate priors [2], respectively. Another family of models described in [21], [9], [7] represents musical note signals by continuous fundamental frequency, amplitude and phase parameters, inferred using Markov Chain Monte Carlo (MCMC) methods [2]. These methods are applicable to all prior distributions in theory, but tend to be computationally demanding in practice. Thus the chosen priors are mostly motivated by computational issues [7]. In particular, the amplitudes of the partials are modeled by independent uniform priors or by conjugate zero-mean Gaussian priors, so that analytical marginalization can be performed.

For both families of models, the above priors exhibit some differences with the empirical parameter distributions. In particular, they do not penalize partials with zero amplitude. This typically leads to missing estimated notes for signals composed of several notes at integer fundamental frequency ratios [21], [7] or to erroneous fundamental frequency estimates equal to a multiple or a submultiple of the true fundamental frequencies [7]. To help solving these limitations, we recently designed a probabilistic harmonic model involving priors motivated by empirical parameter distributions and proposed a variant of the diagonal Laplace method for fast inference [20], since analytical marginalization was no longer feasible with these priors.

In this article, we propose an alternative fast inference method for probabilistic harmonic models, based on approximate factorization of the joint posterior into a product of independent distributions on subsets of parameters. This method is designed for models of the form described in [21], [9], [7], [20], involving explicit frequency, amplitude and phase parameters. It is generic, in that it can be applied to a wide range of priors, and adaptive, since the level of factorization depends on the observed signal and the hypothesized notes. This constitutes a crucial difference compared to variational approximation methods, where the terms of the factorization are fixed a priori and their parameters can only be computed for certain classes of priors. We complete our preliminary work [19] by discussing the extension of this method to nongaussian likelihood and alternative model structures, investigating a new criterion for the choice of the parameter subsets and providing a detailed experimental evaluation.

The structure of the rest of the article is as follows. In Section 2, we present a possible Bayesian network structure for harmonic models and make some mild assumptions about the parameter priors. Then, we describe the proposed inference method in Section 3 and extend it to alternative model structures. In Section 4, we evaluate its performance for the task of multiple pitch estimation on short time frames. We conclude in Section 5 and suggest some perspectives for future research.

Section snippets

Assumptions about the model

The harmonic models in [21], [9], [7], [20] are variations of the same concept. They all represent the observed music signal as a sum of note signals, each composed of several sinusoidal partials parameterized by a sequence of random variables spanning successive time frames. However, the chosen variables and their conditional dependency structure are slightly different for each model. For the sake of clarity, we first discuss our approach for the model structure in [20], which involves fewer

Bayesian inference via adaptive posterior factorization

Harmonic models are typically employed to solve the multiple pitch estimation task, which consists of estimating the number of active notes and their pitches on each time frame. In the present framework, this task translates into finding the maximum a posteriori (MAP) activity state vector S^=argmaxP(S|x), which is achieved by trying a number of candidate vectors S, computing their posterior probabilities P(S|x) and selecting the largest. These probabilities are defined byP(S|x)=P(S,f,r,a,φ|x)d

Evaluation

The precision of the integral estimates obtained by the proposed marginalization method cannot be assessed for realistic signals, since ground truth integral values are not available. However, the aim of marginalization is often not to compute the exact values of the state posteriors P(S|x), but rather to provide accurate multiple pitch estimation, that is to select the right MAP activity state vector S^. Therefore, we evaluated the performance of the proposed method with respect to the latter

Conclusion

We proposed a fast inference method for Bayesian harmonic models based on approximate factorization of the joint posterior into a product of distributions over disjoint parameter subsets and numerical integration of these distributions. A local posterior dependence criterion was exploited to determine relevant subsets. Although factorization based on this criterion is theoretically feasible for any Bayesian model, it does not necessarily provide small subsets, which are needed for subsequent

Acknowledgments

The authors would like to thank the anonymous reviewers for useful comments on the first version of this article. The first author also wishes to thank Cédric Févotte and Simon J. Godsill for motivating discussions about the influence of prior distributions on the performance of harmonic models and the practical use of MCMC methods.

Emmanuel Vincent received the Mathematics degree of the École Normale Supérieure, Paris, France, in 2001 and Ph.D. degree in Acoustics, Signal Processing and Computer Science Applied to Music from the University of Paris-VI Pierre et Marie Curie, Paris, in 2004. From 2004 to 2006, he has been a Research Assistant with the Centre for Digital Music at Queen Mary, University of London, London, UK. He is now a Permanent Researcher with the French National Institute for Research in Computer Science

References (22)

  • J.-F. Cardoso

    Multidimensional independent component analysis

  • G. Casella et al.

    Monte Carlo Statistical Methods

    (2005)
  • A.T. Cemgil et al.

    Efficient variational inference for the dynamic harmonic model

  • A.T. Cemgil et al.

    A generative model for music transcription

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • D.M. Chickering et al.

    Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables

  • A. Cont

    Realtime multiple pitch observation using sparse non-negative constraints

  • M. Davy et al.

    Bayesian analysis of western tonal music

    J. Acoust. Soc. Am.

    (2006)
  • D.P.W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D. Thesis, Department of Electrical...
  • S.J. Godsill et al.

    Bayesian computational models for inharmonicity in musical instruments

  • A. Klapuri et al.

    Signal Processing Methods for Music Transcription

    (2006)
  • Cited by (6)

    Emmanuel Vincent received the Mathematics degree of the École Normale Supérieure, Paris, France, in 2001 and Ph.D. degree in Acoustics, Signal Processing and Computer Science Applied to Music from the University of Paris-VI Pierre et Marie Curie, Paris, in 2004. From 2004 to 2006, he has been a Research Assistant with the Centre for Digital Music at Queen Mary, University of London, London, UK. He is now a Permanent Researcher with the French National Institute for Research in Computer Science and Control (INRIA). His research focuses on structured probabilistic modeling of audio signals applied to blind source separation, indexing and object coding of musical audio.

    Mark D. Plumbley began his research in the area of Neural Networks in 1987, as a Ph.D. Research Student at Cambridge University Engineering Department. Following his Ph.D. he joined King's College London in 1991, and in 2002 moved to Queen Mary University of London to help establish the new Centre for Digital Music. He is currently working on the analysis of musical audio, including automatic music transcription, beat tracking, audio source separation, independent component analysis and sparse coding. Dr. Plumbley currently coordinates two UK Research Networks: the Digital Music Research Network (www.dmrn.org) and the ICA Research Network (www.icarn.org).

    1

    This author was funded by EPSRC Grant GR/S75802/01.

    View full text