Joint DOA and multi-pitch estimation based on subspace techniques

Zhang, Johan Xi; Christensen, Mads Græsbøll; Jensen, Søren Holdt; Moonen, Marc

doi:10.1186/1687-6180-2012-1

Research
Open access
Published: 02 January 2012

Joint DOA and multi-pitch estimation based on subspace techniques

Johan Xi Zhang¹,
Mads Græsbøll Christensen²,
Søren Holdt Jensen¹ &
…
Marc Moonen³

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 1 (2012) Cite this article

4081 Accesses
170 Citations
Metrics details

Abstract

In this article, we present a novel method for high-resolution joint direction-of-arrivals (DOA) and multi-pitch estimation based on subspaces decomposed from a spatio-temporal data model. The resulting estimator is termed multi-channel harmonic MUSIC (MC-HMUSIC). It is capable of resolving sources under adverse conditions, unlike traditional methods, for example when multiple sources are impinging on the array from approximately the same angle or similar pitches. The effectiveness of the method is demonstrated on a simulated an-echoic array recordings with source signals from real recorded speech and clarinet. Furthermore, statistical evaluation with synthetic signals shows the increased robustness in DOA and fundamental frequency estimation, as compared with to a state-of-the-art reference method.

1. Introduction

The problem of estimating the fundamental frequency, or pitch, of a period waveform has been of interest to the signal processing community for many years. Fundamental frequency estimators are important for many practical applications such as automatic note transcription in music, audio and speech coding, classification of music, and speech analysis. Numerous algorithms have been proposed for both the single- and multi-pitch scenarios [1–5]. The problem for single-pitch scenarios is considered as well-posed. However, in real-world signals, the multi-pitch scenario occurs quite frequently [2, 6]. The multi-pitch estimation algorithms are often based on, i.e., various modification of the auto-correlation function [1, 7], maximum likelihood, optimal filtering, and subspace techniques [2, 3, 8]. In real-life recordings, problems such as frequency overlap of sources, reverberation, and colored noise will strongly limit the performance of multi-pitch estimator and estimator designed for single channel recordings often use simplified signal models. One widely used signal simplification in multi-pitch estimators, for example, is the sparseness of the signal, where the frequency spectrum of sources are assumed to not overlap [2]. This assumption may be appropriate when sources consist of mixture of several speech signals having different pitches [9]. However, for audio signals it is less likely to be true. This is especially so in western music, where instruments are most often played in accord, something that causes the harmonics to overlap or even coincide. With only single-channel recording it is, therefore, hard, or perhaps even impossible, to estimate pitches with overlapping harmonics, unless additional information, such as a temporal or spectral model, is included.

Recently, multi-channel approaches have attracted considerable attention both in single- and multi-pitch scenarios. By exploring the spatial information of the sources, more robust pitch estimators have been proposed [10–14]. Most of those multi-channel methods are still mainly based on auto-correlation function-related approaches, however, although a few exceptions can be found in [15–18]. In direction-of-arrival (DOA) estimators, audio and speech signals are often modeled as broadband signal, and standard subspace methods such as MUSIC and ESPRIT are only defined for narrow-band signal model, which then fail to directly operate on broadband signals [19]. One often used concept is band-pass filtering of broadband signals into subbands, where narrow-band estimators can be applied to each subband [20]. In the narrow-band case, a delay in the signal is equivalent to a phase shifts according to the frequencies of complex exponentials. An alternative method is, however, as follows: since harmonic signals consist of sinusoidal components, we can model each source as multiple narrow-band signal with distinct frequencies arriving at the same DOA.

In this article, we propose a parametric method for solving the problem of joint fundamental frequency and DOA estimation based on subspace techniques where the quantities of interest are jointly estimated using a MUSIC-like approach. We term the proposed estimator Multi-channel multi-pitch Harmonic MUSIC (MC-HMUSIC). The spatio-temporal data model used in MC-HMUSIC is based on the JAFE data model [21, 22]. Originally, the JAFE data model was used for estimating joint unconstrained frequencies and DOAs estimates of complex exponential using ESPRIT, which is referred as joint angle-frequency estimation (JAFE) algorithm. Other-related work with joint frequency-DOA methods includes [23–25]. In this article, we have parametrized the harmonic structure of periodic signals in the signal model to model the fundamental frequency and the DOA of individual sources. An estimator is constructed for jointly estimating the parameters of interest. Incorporating the DOA parameter in finding the fundamental frequency may give better robustness against a signal with overlapping harmonics. Similarly, it can be expected that the DOA can be found more accurately when the nature of the signal of interest is taken into account.

The remainder of this article is comprised four sections: Section 2, in which we will introduce some notation, the spatio-temporal signal model, for which we also derive the associated Cramér-Rao lower bound, along with the JAFE data mode; Section 3, where we then present the proposed method; Section 4, in which we present the experimental results obtained using the proposed method; and, finally, Section 5, where we conclude on our work.

2. Fundamentals

2.1. Spatio-temporal signal model

Next, the signal model employed throughout the article will be presented. Without multi-path propagation of sources, it is given as follows: the signal x_i received by microphone element i arranged in a uniform linear array (ULA) configuration, i = 1,..., M, is given by

\begin{array}{l} x_{i} (n) & = \sum_{k = 1}^{K} \sum_{l = 1}^{L_{k}} β_{l, k} e^{j (ω_{k} l n + ϕ_{k} l (i - 1))} + e_{i} (n), \\ β_{l, k} & = A_{l, k} e^{j γ_{l, k}}, \end{array}

(1)

for sample index n = 0,..., N - 1, where subscript k denotes the k th source and l the l th harmonic. Moreover, A_l,k is the real-valued positive amplitude of the complex exponential, L_k is the number of harmonics, K is number of sources, γ_l,k is the phase of the individual harmonics, ϕ_k is the phase shift caused by the DOA, and e_i(n) is complex symmetric white Gaussian noise. The phase shift between array elements is given as $ϕ_{k} = ω_{k} f_{s} \frac{d}{c} sin (θ_{k})$ , where d is the spacing between the elements measured in wavelengths, c is the speed of propagation in unit [m/s], θ_k is the DOA defined for θ_k ∈ [-90°, 90°], f_s is the signal sampling frequency. The problem of interest is to estimate ω_k and θ_k. We in the following assume that the number of sources K is known and the number of harmonics L_k of individual sources is known or found in some other, possibly joint, way. We note that a number of ways of doing this has been proposed in the past [26–28, 2].

2.2. Cramér-Rao lower bound

We will now proceed to derive the exact Cramér-Rao lower bound (CRLB) for the problem of estimating the parameters of interest. First, we define the M × 1 deterministic signal model vector s(n, μ) with column element as

s_{i} (n, μ) = \sum_{k = 1}^{K} \sum_{l = 1}^{L_{k}} β_{l, k} e^{j (ω_{k} l n + ϕ_{k} l (i - 1))}, β_{l, k} = A_{l, k} e^{j γ_{l, k}},

(2)

where s(n, μ) = [s₁(n, μ) ... s_M(n, μ)]^T. Furthermore, the parameter vector μ is given by

\begin{matrix} μ = [ω_{1} \cdot \cdot \cdot ω_{K} θ_{1} \cdot \cdot \cdot θ_{K} \\ A_{1, 1} γ_{1, 1} \cdot \cdot \cdot A_{L_{K}, K} γ_{L_{K}, K}] . \end{matrix}

(3)

Recall that the observed signal vector with additive white noise is given by

x (n) = s (n, μ) + e (n) = [\begin{matrix} s_{1} (n, μ) \\ ⋮ \\ s_{M} (n, μ) \end{matrix}] + e (n),

(4)

with e(n) being the noise column vector. The CRLB is defined as the variance of an unbiased estimate of the p th element of μ, which is lower bounded as

var (μ_{p}) \geq {[C^{- 1}]}_{p p},

(5)

where C is the so-called Fisher information matrix given by

C = \frac{2}{σ^{2}} Re (\sum_{n = 0}^{N - 1} \frac{\partial s {(n, μ)}^{H}}{\partial μ} \frac{\partial s (n, μ)}{\partial μ^{T}}) .

(6)

The partial derivative matrix is denoted as

\frac{\partial s (n, μ)}{\partial μ} = [\frac{\partial s_{1} (n, μ)}{\partial μ} \cdot \cdot \cdot \frac{\partial s_{M} (n, μ)}{\partial μ}],

(7)

where vector $\frac{\partial s_{i} (n, μ)}{\partial μ}$ is the partial derivatives with respect to the entries in the vector μ. The expression for the columns in $\frac{\partial s (n, μ)}{\partial μ}$ is given as

\frac{\partial s_{i} (n, μ)}{\partial μ} = [\begin{gathered} \sum_{l = 1}^{L_{1}} j l (n + (i - 1) f_{s} \frac{d}{c} sin (θ_{1})) β_{l, 1} e^{j (ω_{1} l n + ϕ_{1} l (i - 1))} \\ ⋮ \\ \sum_{l = 1}^{L_{K}} j l (n + (i - 1) f_{s} \frac{d}{c} sin (θ_{K})) β_{l, K} e^{j (ω_{K} l n + ϕ_{K} l (i - 1))} \\ \sum_{l = 1}^{L_{1}} (j l (i - 1) ω_{1} f_{s} \frac{d}{c} cos (θ_{1})) β_{l, 1} e^{j (ω_{1} l n + ϕ_{1} l (i - 1))} \\ ⋮ \\ \sum_{l = 1}^{L_{K}} (j l (i - 1) ω_{K} f_{s} \frac{d}{c} cos (θ_{K})) β_{l, K} e^{j (ω_{K} l n + ϕ_{K} l (i - 1))} \\ e^{j γ_{1, 1}} e^{j (ω_{1} n + ϕ_{1} (i - 1))} \\ j A_{1, 1} e^{j γ_{1, 1}} e^{j (ω_{1} n + ϕ_{1} (i - 1))} \\ ⋮ \\ e^{j γ_{L_{K}, K}} e^{j (ω_{K} L_{K} n + ϕ_{K} L_{K} (i - 1))} \\ j A_{L_{K}, K} e^{j γ_{L_{K}, K}} e^{j (ω_{K} L_{K} n + ϕ_{K} L_{K} (i - 1))} \end{gathered}] .

(8)

2.3. The JAFE data model

Next, we will introduce the specifics of the JAFE data model [22, 29] that our method is based on. At a time instant n the received signal from the M array elements are x(n) = [x₁(n) x₂(n) ... x_M(n)]^T, which can be written as

x (n) = A Φ b + e (n),

(9)

where e(n) ∈ ℂ^M^×1 is the noise vector, and A = [A₁ ... A_K] is a Vandermonde matrix containing parameters ω_k and θ_k for sources k = 1, . . . , K, i.e.,

A_{k} = [a (θ_{k}, ω_{k} 1) \cdot \cdot \cdot a (θ_{k}, ω_{k} L_{k})],

(10)

with a(θ, ω) being the array steering vector given by

a (θ, ω) = {[1 \cdot \cdot \cdot e^{j ω f_{s} \frac{d}{c} (M - 1) sin (θ)}]}^{T} .

(11)

Here, (·)^T denotes the vector transpose. Unlike the steering vector defined in [22, 21], where only the DOA is parametrized, here, a general definition of the vector (11) is used, in which it depends on both θ and ω [29]. The frequency components are expressed in $Φ^{n} = diag ([Φ_{1}^{n} \dots Φ_{K}^{n}])$ where the matrix for each source is given by

Φ_{k} = diag ([e^{j ω_{k}} \cdot \cdot \cdot e^{j ω_{k} L_{k}}]) .

(12)

The complex amplitudes for involving components are represented by the following vector:

b = {[β_{1, 1} \cdot \cdot \cdot β_{L_{1}, 1} \cdot \cdot \cdot β_{1, K} \cdot \cdot \cdot β_{L_{K}, K}]}^{T} .

(13)

To capture the temporal behavior, N time-domain data samples of the array output x(n) are collected to form the M × N data matrix X, which is defined as

X = [x (n) \cdot \cdot \cdot x (N)] .

(14)

Due to the structure of the harmonic components, the data matrix is given by

X = A [b Φ b \cdot \cdot \cdot Φ^{N - 1} b] + E,

(15)

where E ∈ ℂ^M×N is a matrix containing N sample of the noise vector e(n).

In speech and audio signal processing, it is common to model each source as a set of multiple harmonics with model order L_k > 1. Due to the narrow-band approximation of the steering vector, the multiple complex components with distinct frequencies impinge on the array with identical DOA will result in a non-unique spatial frequencies which cause a harmonic structure in the spatial frequencies ϕ_kl ∀l as well. The multiple sources impinge on the array with different DOAs consisting of various frequency components may, for certain frequency combinations, give the same array steering vector, which cause the matrix A to be rank deficient. Normally, this ambiguous mapping of the steering vector is mitigated by band-pass filtering the signal into its subbands, where the DOA of the signal is uniquely modeled by the narrow-band steering vector [20, Chap. 9].

Here, the ambiguities and the rank-deficiency are avoided by introducing temporal smoothness in order to restore the rank of A. The temporally smoothed data matrix is obtained by stacking t times temporally shifted versions of the original data matrix [22, 21, 29], given as

X_{t} = [\begin{gathered} A [b Φ b \cdot \cdot \cdot Φ^{N - t} b] \\ A Φ [b Φ b \cdot \cdot \cdot Φ^{N - t} b] \\ ⋮ \\ A Φ^{t - 1} [b Φ b \cdot \cdot \cdot Φ^{N - t} b] \end{gathered}] + E_{t},

(16)

where X_t ∈ ℂ^tM^×^N^-^t⁺¹ is the temporally smoothed data matrix, and E_t is the noise term constructed from E in a similar way as X_t. In using the signal model where the amplitudes are assumed stationary for n = 0, . . . , N - 1, X_t can be factorized as

X_{t} = [\begin{matrix} A \\ A Φ \\ ⋮ \\ A Φ^{t - 1} \end{matrix}] [b Φ b \cdot \cdot \cdot Φ^{N - t} b] + E_{t} .

(17)

With some additional definitions, we can also write this expression more compactly as

X_{t} = {\bar{A}}_{t} B_{t} + E_{t},

(18)

where Ā_t = [A AΦ ... AΦ^t^-1]^T and B_t = [b Φb ... Φ^N-t b]. The temporally smoothed data matrix X_t can maximally resole up to $t M \geq \sum_{k = 1}^{K} L_{k}$ complex exponentials, where Ā_t is linearly independent for any distinct θ and ω [30].

When multiple sources with distinct DOA with the same fundamental frequency impinge on the array, it will result in correlation between the underlying signals, which will make it harder to separate the corresponding components into its eigenvectors [22, 31]. To mitigate this problem, spatial smoothing is introduced, which works as follows. An array of M sensors is subdivided into S subarrays. In this article, the subarrays are spatially shifted with one element in each subarrays, the number of elements in each subarray being M_S = M - S + 1. For s = 1, . . . , S, let $J_{s} \in ℂ^{t M_{s} \times t M}$ be the selection matrix corresponding to the s th subarray for the data matrix X_t. Then, the spatio-temporally smoothed data matrix $X_{t, S} \in ℂ^{t M_{s} \times S (N - t + 1)}$ is given by

X_{t, s} = [J_{1} X_{t} \cdot \cdot \cdot J_{S} X_{t}] .

(19)

Furthermore, X_t,s can be factorized as

X_{t, s} = [J_{1} {\bar{A}}_{t} \cdot \cdot \cdot J_{S} {\bar{A}}_{t}] [\begin{matrix} B_{t} \\ ⋱ \\ B_{t} \end{matrix}] + E_{t, s},

(20)

where E_t,s is the noise term constructed from E in a similar way as X_t,s. Using the shift invariance structure in A_m, the term J_sA_m for s = 1, . . . , S is given by

J_{s} {\bar{A}}_{t} = J_{1} {\bar{A}}_{t} Θ^{s - 1},

(21)

where

Θ = diag \{[e^{j ϕ_{1} 1} \dots e^{j ϕ_{1} L_{1}} \cdot \cdot \cdot e^{j ϕ_{K} 1} \dots e^{j ϕ_{K} L_{K}}]\},

(22)

which is simply the phase difference between array elements. With (21), the matrix X_t,s can be written in a compact form as

X_{t, s} = J_{1} {\bar{A}}_{t} [B_{t} Θ B_{t} \cdot \cdot \cdot Θ^{S - 1} B_{t}] + E_{t, s},

(23)

with selection matrix expressed as

J_{1} = I_{t} \otimes [I_{M_{s}} 0],

(24)

where I_t ∈ ℝ^t×t and $I_{M_{s}} \in ℝ^{M_{s} \times M_{s}}$ are the identity matrices, ⊗ is the Kroneker product as defined in [22].

It is interesting to note that the noise term E_t,s is no longer white due to the spatio-temporal smoothing procedure, as correlation between the different rows of (23) is obtained. A pre-whitening step can be implemented in (23) to mitigate this. We note, however, that according to results reported in [22], pre-whitening step is only interesting for signals with low SNR where minor estimation improvement can be achieved. In this article, the main interest is to propose a multi-channel joint DOA and multi-pitch estimator, for which reason the whitening process is left without further description, but we refer the interested reader to [22]. We also note that aside from spatial smoothing, forward-backward averaging could also be implemented to reduce the influence of the correlated sources [22, 31, 19].

3. The proposed method

3.1. Coarse estimates

From the final spatio-temporally smoothed data matrix, a basis for the signal and noise subspaces can be obtained as follows. The singular value decomposition (SVD) of the data matrix (23) is given by

X_{t, s} = U Σ V^{H},

(25)

where the columns of U are the singular vectors, i.e.,

U = [u_{1} \cdot \cdot \cdot u_{t M_{S}}] .

(26)

A basis of the orthogonal complement of the signal subspace, also called the noise subspace, is formed from singular vector associated with the mM_S - Q least significant singular values, i.e.,

G = [u_{Q + 1} \cdot \cdot \cdot u_{m M_{S}}],

(27)

with $Q = \sum_{k = 1}^{K} L_{k}$ being the total number of complex exponentials in the signal. Similarly, the signal subspace is spanned by the Q largest singular values, i.e.,

S = [u_{1} \cdot \cdot \cdot u_{Q}] .

(28)

The defined signal subspace and noise subspace have similar property as traditional subspaces where estimators such as joint DOA and frequency, or fundamental frequency estimators can be constructed using the principle used in MUSIC [19, 32, 27, 26, 4]. According to the signal noise subspace orthogonality principle, the following relationship holds:

J_{1} {\bar{A}}_{t} G = 0,

(29)

where we, for notational simplicity, have introduced J₁Ā_t = A_ts. The matrix A_ts is comprised Vandermonde matrices for sources k = 1, . . . , K. The matrix for each individual source is given by

A_{t s, k} = [\begin{matrix} 1 & \dots & 1 \\ e^{j ϕ_{k}} & e^{j ϕ_{k} L_{k}} \\ ⋮ & ⋮ \\ e^{j ϕ_{k} S} & \dots & e^{j ϕ_{k} L_{k} S} \\ ⋮ & ⋮ \\ e^{j ω_{k} (t - 1)} & \dots & e^{j ω_{k} L_{k} (t - 1)} \\ e^{j ϕ_{k}} e^{j ω_{k} (t - 1)} & e^{j ϕ_{k} L_{k}} e^{j ω_{k} L_{k} (t - 1)} \\ ⋮ & ⋮ \\ e^{j ϕ_{k} S} e^{j ω_{k} (t - 1)} & \dots & e^{j ϕ_{k} L_{k} S} e^{j ω_{k} L_{k} (t - 1)} \end{matrix}] .

(30)

The cost function of the proposed joint DOA and multi-pitch estimator is then

J (ω_{k}, θ_{k}) = {∥A_{t s, k}^{H} G∥}_{F}^{2},

(31)

where ||·||_F is the Frobenius Norm. Note that this measure is closely related to the angles between the subspaces as explained in [33] and can hence be used as a measure of the extent to which (29) holds for a candidate fundamental frequency and DOA. The pair of fundamental frequency and DOA can, therefore, be found as the combination that is the closest to being orthogonal to G, i.e.,

{ω_{k}, θ_{k}}_{k = 1}^{K} = arg min_{{θ_{k}}_{1}^{K}, {ω_{k}}_{1}^{K}} {∥A_{t s, k}^{H} G∥}_{F}^{2} .

(32)

The multi-channel estimators will have a cost function which is more well-behaved compared to those of single channel multi-pitch estimators (see, e.g., [26, 32, 28] for some examples of such).

3.2. Refined estimates

For many applications, only a coarse estimate of involved fundamental frequencies and DOAs are needed, in which case the cost function in (32) is evaluated on pre-defined search region with some specified granularity. If, however, very accurate estimates are desired, a refined estimate can be found as described next. For a rough estimate of the parameter of interests, refined estimates are obtained by minimizing the cost function in (32) using a cyclic minimization approach. The gradient of the cost function (32) for fundamental frequency and DOA are given as

\frac{\partial}{\partial ω_{k}} J (ω_{k}, θ_{k}) = 2 Re (Tr \{A_{t s, k}^{H} G G^{H} \frac{\partial}{\partial ω_{k}} A_{t s, k}\}),

(33)

\frac{\partial}{\partial θ_{k}} J (ω_{k}, θ_{k}) = 2 Re (Tr \{A_{t s, k}^{H} G G^{H} \frac{\partial}{\partial θ_{k}} A_{t s, k}\}),

(34)

with Re (·) denoting the real value. The gradient can be used for finding refined estimate using standard methods.

Here, we iteratively find a refined estimate using a cyclic approach. During an iteration, ω_k is first estimated with

{\hat{ω}}_{k}^{i + 1} = {\hat{ω}}_{k}^{i} - δ \frac{\partial}{\partial ω_{k}} J ({\hat{ω}}_{k}^{i}, {\hat{θ}}_{k}^{i}),

(35)

where i is the iteration index and δ is a small positive constant that is found using line search. The estimated ${\hat{ω}}_{k}^{i + 1}$ is then used to initialize the minimization function for DOA, which is then found as

{\hat{θ}}_{k}^{i + 1} = {\hat{θ}}_{k}^{i} - δ \frac{\partial}{\partial θ_{k}} J ({\hat{ω}}_{k}^{i + 1}, {\hat{θ}}_{k}^{i}) .

(36)

The method is initialized for i = 0 using the coarse estimates obtained from (32).

4. Experimental results

4.1. Signal examples

We start the experimental part of this article by illustrating the application of the proposed method to analyzing a mixed signal consisting of speech and clarinet signals, sampled at f_s = 8000 Hz. The single-channel signals are converted into a multi-channel signal by introducing different delays according to two pre-determined DOA to simulate a microphone array with M = 8 channels. The simulated DOAs of the speech and the clarinet signals are, respectively, θ₁ = -45° and θ₂ = 45°. The spectrogram of the mixed signal of the first channel is illustrated in Figure 1. To avoid spatial ambiguities, the distance between two sensor is half the wavelength of the highest frequency in the observed signal, here d = 0.0425 m. The mixed signal is segmented into 50% overlapped signal segments with N = 128. The user parameter selected in this experiment is $t = ⌊\frac{2 N}{3}⌋$ and $s = ⌊\frac{M}{2}⌋$ . The cost function is evaluated with a Vandermonde matrix with L = 5 complex exponentials, and the noise subspace is formed from an overestimated signal subspace with assumption of signal subspace containing N/2 = 64 complex exponentials. The signal subspace overestimation technique is usually used when the true order of the signal subspace is unknown, the signal subspace is assumed to be larger than the true one which can minimize the signal subspace components in the noise subspace. An added benefit of posing the problem as a joint estimation problem is that the multi-pitch estimation problem can be seen as several single-pitch problems for a distinct set of DOAs, one per source. Therefore, it is less important to select an exact signal model order than single-channel multi-pitch estimators would need [28]. The cost function is evaluated for frequencies from 100 to 500 with granularity of 0.52 Hz. The evaluated results are illustrated in Figure 2 where the upper panel contains the fundamental frequency estimates and lower panel the DOA estimates. It can be seen that the proposed algorithm can track the fundamental frequency and the DOA of the speech signal well, with only a few observed errors on regions with low signal energy. The clarinet signal's DOA and fundamental frequencies have also been estimated well for all segments.

For the purpose of further comparison, the same signal will be analyzed using a standard time delay-and-sum beamformer [34] for DOA estimates and a single-channel maximum-likelihood based pitch estimator applied on the beamformed output signals [2]. The results are shown in Figure 3. The figure clearly shows that the delay-sum beamformer cannot satisfactory resolve the DOAs with M = 8 array elements which will further affect the performance of the single-channel pitch estimator, as shown in the upper panel. In this example, the proposed algorithm shown in Figure 2 is superior compared to reference method shown in Figure 3. The low resolution performance of the reference method will make the statistical evaluation of this method uninteresting, and we, therefore, will not be using it any further in the experiments to follow.

4.2. Statistical evaluation

Next, we use Monte Carlo simulations evaluated on synthetic signals embedded in noise in assessing the statistical properties of the proposed method and compare it with the exact CRLB. As a reference method for pitch and DOA estimation, we use the JAFE algorithm proposed in [22] for jointly estimating unconstrained frequencies and DOAs. Next, the unconstrained frequencies are grouped according to their corresponding DOAs where closely related directions are grouped together. A fundamental frequency is formed from these grouped frequencies in a weighted way as proposed in [35]. We refer this as the WLS estimator. In order to remove the errors due to the erroneous estimate of amplitudes, we assume WLS having the exact signal amplitude given. The WLS estimator is a computationally efficient pitch estimation method with good statistical properties. The reference DOA estimate is easily obtained in a similar way from the mean value of these grouped DOAs according to [22].

Here, we consider a M = 8 element ULA with sensor distance d = 0.0425 with a sampling frequency of f_s = 8000. The estimators are evaluated for two signal setups, first with two sources having ω₁ = 252.123 and ω₂ = 300.321 with L_1,2 = 3, and second with one harmonic source of ω₁ = 252.123 and L₁ = 3. All amplitudes on individual harmonics are set to unity A_k,l = 1 for tractability. Both sources are assumed to be far-field sources impinging on the array with DOAs at θ₁ = -43.23° and θ₂ = 70°, respectively, and for one source having a DOA of θ₁ = -43.23°. All simulation results are based on 100 Monte Carlo runs. The performance is measured using the root mean squared estimation error (RMSE) as defined in [28, 32, 26, 27]. The user parameter for JAFE data model is selected to the optimal values as proposed in [22] with temporal and spatial smoothness parameters, $t = ⌊\frac{2 N}{3}⌋$ and $s = ⌊\frac{M}{2}⌋$ , respectively. We note that in practical applications, the computational complexity has to also be considered in selecting the appropriate parameters t and s. An example of the 2-dimensional (2D) cost function of our proposed method evaluated on two mixed signal is illustrated in Figure 4, where a coarser estimate of the DOA and fundamental estimates can be identified from the two peaks in the 2D cost function.

In the first simulation, we evaluate the proposed method's statistical properties in a single source scenario for varying sample lengths and SNRs. The RMSEs on signal with varying N are shown in Figure 5, and with varying SNR in Figure 6. It can be seen from these figures that both estimators perform well for all SNR above 0 dB with WLS being slightly better for fundamental frequency estimation while the proposed estimator is better in DOA estimation. Both methods are also able to follow CRLB closely for around sample length N > 60. The better DOA estimation capabilities of the proposed method can be explained by the joint estimation of the fundamental frequency and DOA, which leads to increased robustness under adverse conditions. Both estimators can be considered as consistent in the single-pitch scenario.

Next, we evaluate our method for the multi-pitch scenario. The so-obtained RMSEs for varying N and SNR are depicted in Figures 7 and 8. In Figure 7, it clearly shows that the proposed method is better than the WLS estimator for short sample lengths. The WLS estimator is not following CRLB until N > 80 samples while the proposed estimator is for N > 64. The remaining gap between CRLB and both evaluated estimators for N > 80 are due to the mutual interference between the harmonic sources. The slowly converging performance of WLS is mainly due to the bad estimate of the unconstrained frequency estimate using the JAFE method. With our selected simulation setup, the JAFE estimator is not giving consistent estimates for all harmonic components, which, in turn, results in poor performance in the WLS estimates. In general, the WLS estimator is sensitive to spurious estimate of the unconstrained frequencies. Moreover, the proposed estimator, which is jointly estimating both the DOA and the fundamental frequency, yields better estimates for smaller sample length N. The results in terms of RMSEs for varying SNRs are shown in Figure 8. This figure shows that the proposed estimator is again more robust than the WLS estimator for both DOA and fundamental frequency estimation.

In next two experiments, we will study the performance as a function of the difference in fundamental frequencies and DOAs for multiple sources. We start with studying the RMSE as a function of the difference between the fundamental frequencies of two harmonic sources, i.e., Δω = |ω₁ - ω₂|, with θ₁ = -43.321° and θ₂ = 70°. Here, we use an SNR set to 40 dB, and a sample length N = 64 with M = 8 array elements. The obtained RMSEs are shown in Figure 9. The figure clearly shows that both methods can successfully estimate the fundamental frequencies and DOAs. Once again the proposed estimator gives more robust estimates, close to the CRLB. Additionally, it should be noted that both methods are correctly estimating the DOA even when the both fundamental frequencies are identical ω₁ = ω₂, something that would not be possible with only a single channel. MC-HMUSIC has the ability to estimate the fundamental frequencies when both harmonics are identical provided that the DOAs are distinct and vice versa. Estimation of the parameters of signals with overlapping harmonics is a crucial limitation in multi-pitch estimation using only single-channel recordings. In the final experiment, the RMSE as a function of the difference between the DOAs of two harmonic sources Δθ = |θ₁ - θ₂| is analyzed for an SNR set to 40 dB and a sample length of N = 64 with M = 8 array elements. The fundamental frequencies are ω₁ = 252.123 and ω₂ = 300.321, respectively. The observations and conclusions are basically the same as before, with the proposed method outperforming the reference method so far.

5. Conclusion

In this article, we have generalized the single-channel multi-pitch problem into a multi-channel multi-pitch estimation problem. To solve this new problem, we propose an estimator for joint estimation of fundamental frequencies and DOAs of multiple sources. The proposed estimator is based on subspace analysis using a time-space data model. The method is shown to have potential in applications to real signals with simulated anechoic array recording, and a statistical evaluation demonstrates its robustness in DOA and fundamental frequency estimation as compared to a state-of-the-art reference method. Furthermore, the proposed method is shown to have good statistical performance under adverse conditions, for example for sources with similar DOA or fundamental frequency.

References

Klapuri A: Automatic music transcription as we know it today. J New Music Res 2004, 33: 269-282.
Article Google Scholar
Christensen MG, Jakobsson A: Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing 2009.
Google Scholar
Rabiner L: On the use of autocorrelation analysis for pitch detection. IEEE Trans Signal Process 1996, 44: 2229-2244.
Article Google Scholar
Zhang JX, Christensen MG, Jensen SH, Moonen M: A robust and computationally efficient subspace-based fundamental frequency estimator. IEEE Trans Acoust Speech Language Process 2010, 18(3):487-497.
Article Google Scholar
de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 2002, 111(4):1917-1930.
Article Google Scholar
Wang DL, Brown GJ: Computational Auditory Scene Analysis: Principle, Algorithm, and Applications. Wiley, IEEE Press, New York; 2006.
Chapter Google Scholar
Klapuri A: Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 2003, 11: 804-816.
Article Google Scholar
Emiya V, Bertrand D, Badeau R: A parametric method for pitch estimation of piano tones. IEEE International Conference on Acoustics, Speech, and Signal Processing 2007, 1: 249-252.
Google Scholar
Rickard S, Yilmaz O: Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process 2004, 52: 1830-1847.
Article MathSciNet Google Scholar
Wohmayr M, Kepsi M: Joint position-pitch extraction from multichannel audio. Proceedings of the Interspeech 2007.
Google Scholar
Qian X, Kumaresan R: Joint estimation of time delay and pitch of voiced speech signals. Record of the Asilomar Conference on Signals, Systems, and Computers 1996., 2:
Google Scholar
Wrigley SN, Brown GJ: Recurrent timing neural networks for joint F0-localisation based speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing 2007.
Google Scholar
Flego F, Omologo M: Robust F0 estimation based on a multi-microphone periodicity function for distant-talking speech. EUSIPCO 2006.
Google Scholar
Armani L, Omologo M: Weighted auto-correlation-based F0 estimation for distant-talking interaction with a distributed microphone network. IEEE International Conference on Acoustics, Speech and Signal Processing 2004, 1: 113-116.
Google Scholar
Chazan D, Stettiner Y, Malah D: Optimal multi-pitch estimation using the em algorithm for co-channel speech separation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 1993.
Google Scholar
Liao G, So HC, Ching PC: Joint time delay and frequency estimation of multiple sinusoids. IEEE International Conference on Acoustics, Speech and Signal Processing 2001, 5: 3121-3124.
Google Scholar
Wu Y, So HC, Tan Y: Joint time-delay and frequency estimation using parallel factor analysis. Elsevier Signal Process 2009, 89: 1667-1670.
Article Google Scholar
Ngan LY, Wu Y, So HC, Ching PC, Lee SW: Joint time delay and pitch estimation for speaker localization. Proceedings of the IEEE International Symposium on Circuits and Systems 2003, 722-725.
Google Scholar
Stoica P, Moses R: Spectral Analysis of Signals. Prentice-Hall, Upper Saddle River; 2005.
Google Scholar
Brandstein M, Ward D: Microphone Arrays. Springer, Berlin; 2001.
Book Google Scholar
van der Veen AJ, Vanderveen M, Paulraj A: Joint angle and delay estimation using shift invariance techniques. IEEE Trans Signal Process 1998, 46: 405-418.
Article Google Scholar
Lemma AN, van der Veen AJ, Deprettere EF: Analysis of joint angle-frequency estimation using ESPRIT. IEEE Trans Signal Process 2003, 51: 1264-1283.
Article MathSciNet Google Scholar
Viberg M, Stoica P: A computationally efficient method for joint direction finding and frequency estimation in colored noise. Record of the Asilomar Conference on Signals, Systems, and Computers 1998, 2: 1547-1551.
Google Scholar
Lin JD, Fang WH, Wang YY, Chen JT: FSF MUSIC for joint DOA and frequency estimation and its performance analysis. IEEE Trans Signal Process 2006, 54: 4529-4542.
Article Google Scholar
Wang S, Caffery J, Zhou X: Analysis of a joint space-time doa/foa estimator using MUSIC. IEEE International Symposium on Personal, Indoor and Mobile Radio Communications 2001, B138-B142.
Google Scholar
Christensen MG, Stoica P, Jakobsson A, Jensen SH: Multi-pitch estimation. Elsevier Signal Process 2008, 88(4):972-983.
Article Google Scholar
Christensen MG, Jakobsson A, Jensen SH: Joint high-resolution fundamental frequency and order estimation. IEEE Trans. Acoust Speech Signal Process 2007, 15(5):1635-1644.
Google Scholar
Zhang JX, Christensen MG, Jensen SH, Moonen M: An iterative subspace-based multi-pitch estimation algorithm. Elsevier Signal Process 2011, 91: 150-154.
Article Google Scholar
Lemma AN: ESPRIT based joint angle-frequency estimation algorithms and simulations. PhD Thesis Delft University 1999.
Google Scholar
Shu T, Liu XZ: Robust and computationally efficient signal-dependent method for joint DOA and frequency estimation. EURASIP J Adv Signal Process 2008., 2008: Article ID 10.1155/2008/134853
Google Scholar
Krim H, Viberg M: Two decades of array processing research-the parametric approach. IEEE SP Mag 1996.
Google Scholar
Christensen MG, Jakobsson A, Jensen SH: Multi-pitch estimation using Harmonic MUSIC. Record of the Asilomar Conference on Signals, Systems, and Computers 2006, 521-525.
Google Scholar
Christensen MG, Jakobsson A, Jensen SH: Sinusoidal order estimation using angles between subspaces. EURASIP J Adv Signal Process 2009, 1-11. Article ID 948756
Google Scholar
Veen BDV, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 4-24.
Article Google Scholar
Li H, Stoica P, Li J: Computationally efficient parameter estimation for harmonic sinusoidal signals. Elsevier Signal Process 2000, 1937-1944.
Google Scholar

Download references

Acknowledgements

The study of Zhang was supported by the Marie Curie EST-SIGNAL Fellowship, Contract No. MEST-CT-2005-021175.

Author information

Authors and Affiliations

Department of Electronic Systems (ES-MISP), Aalborg University, Aalborg, Denmark
Johan Xi Zhang & Søren Holdt Jensen
Department of Architecture, Design and Media Technology, Aalborg University, Denmark
Mads Græsbøll Christensen
Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Leuven, Belgium
Marc Moonen

Authors

Johan Xi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mads Græsbøll Christensen
View author publications
You can also search for this author in PubMed Google Scholar
Søren Holdt Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Marc Moonen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johan Xi Zhang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, J.X., Christensen, M.G., Jensen, S.H. et al. Joint DOA and multi-pitch estimation based on subspace techniques. EURASIP J. Adv. Signal Process. 2012, 1 (2012). https://doi.org/10.1186/1687-6180-2012-1

Download citation

Received: 26 March 2011
Accepted: 02 January 2012
Published: 02 January 2012
DOI: https://doi.org/10.1186/1687-6180-2012-1

Joint DOA and multi-pitch estimation based on subspace techniques

Abstract

1. Introduction

2. Fundamentals

2.1. Spatio-temporal signal model

2.2. Cramér-Rao lower bound

2.3. The JAFE data model

3. The proposed method

3.1. Coarse estimates

3.2. Refined estimates

4. Experimental results

4.1. Signal examples

4.2. Statistical evaluation

5. Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords