Skip to main content
Erschienen in:

Open Access 14.06.2022

A Music Cognition–Guided Framework for Multi-pitch Estimation

verfasst von: Xiaoquan Li, Yijun Yan, John Soraghan, Zheng Wang, Jinchang Ren

Erschienen in: Cognitive Computation | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As one of the most important subtasks of automatic music transcription (AMT), multi-pitch estimation (MPE) has been studied extensively for predicting the fundamental frequencies in the frames of audio recordings during the past decade. However, how to use music perception and cognition for MPE has not yet been thoroughly investigated. Motivated by this, this demonstrates how to effectively detect the fundamental frequency and the harmonic structure of polyphonic music using a cognitive framework. Inspired by cognitive neuroscience, an integration of the constant Q transform and a state-of-the-art matrix factorization method called shift-invariant probabilistic latent component analysis (SI-PLCA) are proposed to resolve the polyphonic short-time magnitude log-spectra for multiple pitch estimation and source-specific feature extraction. The cognitions of rhythm, harmonic periodicity and instrument timbre are used to guide the analysis of characterizing contiguous notes and the relationship between fundamental frequency and harmonic frequencies for detecting the pitches from the outcomes of SI-PLCA. In the experiment, we compare the performance of proposed MPE system to a number of existing state-of-the-art approaches (seven weak learning methods and four deep learning methods) on three widely used datasets (i.e. MAPS, BACH10 and TRIOS) in terms of F-measure (\({F}_{1}\)) values. The experimental results show that the proposed MPE method provides the best overall performance against other existing methods.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Estimation and tracking of multiple fundamental frequencies is one of the major tasks in automatic music transcription (AMT) of polyphonic music analysis [1] and music information retrieval (MIR) [2], which is referred to as a subtask in the Music Information Retrieval Evaluation eXchange (MIREX).1 Multiple fundamental frequency estimation (MFE), also namely multiple pitch estimation (MPE), is challenging in processing simultaneous notes from multiple instruments in polyphonic music [3, 4]. There is often a trade-off between the robustness and efficiency of algorithms that focuses more on complexity rather than single-pitch estimation.
According to Benetos et al. [5], the MPE approaches are categorised into three types, i.e. feature based, spectrogram-factorization based and statistical model–based methods. In feature-based methods, signal processing techniques such as the pitch salience function [6] and pitch candidate set score function [7] are used. In spectrogram-factorization methods, both the nonnegative matrix factorisation (NMF) and the probabilistic latent component analysis (PLCA) approaches have received a lot of attention in recent years [6], and numerous improved versions [8, 9] based on both methods have been published and are recognised as leading spectrogram factorization-based methods in the MPE domain. The statistical model-based methods employ the maximum a posteriori (MAP) [3] estimation, maximum likelihood (ML) or Bayesian theory [10] to detect the fundamental frequencies.. It is worth noting that these three distinct types of MPE approaches can be joined or interacted with [6] for a variety of applications.
In recent years, many deep learning (DL)–based supervised MPE approaches have also been developed. Cheuk et al. [11] presented a DL model for AMT by combining the U-Net and bidirectional long short-term memory (BiLSTM) neural network modules. Mukherjee et al. [12] used statistical characteristics and an extreme learning machine for musical instrument segregation, where LSTM and the recurrent neural network (RNN) [13] were combined to differentiate the musical chords for AMT. Fan et al. [14] proposed a deep neural network to extract the singing voice, followed by a dynamic unbroken pitch determination algorithm to track pitches. Sigtia et al. [15] developed a supervised approach for polyphonic piano music transcription that included a RNN and a probabilistic graphical model. Although DL approaches may provide adequate music transcriptions, they often require high-performance computers and excellent graphic processing units (GPU) to speed-up the lengthy training process [16]. Furthermore, DL algorithms may suffer from inaccurately labelled data, and the performance may be susceptible to the training samples and the learning procedures used. To this end, in this paper, we focus mainly on cognitive method, where the prior cognitive theories and assumptions from previous studies [1719] will be used to guide the fundamental pitch detection in polyphonic music pieces.
To distinguish the pitch using harmonic analysis, two types of statistical models are often used. One is the expectation–maximization (EM)-based algorithms [20], and the other is Bayesian-based algorithms [21]. For EM-based methods, Emiya et al. [22] proposed a maximum likelihood–based method for multi-pitch estimation. Duan and Temperley [23] proposed a three-stage music transcription system and applied maximum likelihood for final note tracking. For Bayesian-based methods, Alvarado Duran [24] combined Gaussian processes and Bayesian models for multi-pitch estimation. Nishikimi et al. [25] integrate hidden Markov Model and Bayesian inference together to precisely detect the vocal pitch. Those statistical models can be also considered as shallow learning methods, as data should first be observed to gain some prior knowledge, based on which the experiments should then be conducted. After constant addition of the information of the new samples into prior distribution, the posterior inference can be delivered along with the final results. Although the shallow learning approaches have been widely investigated [26], they still have much room to improve.
Apart from the aforementioned issues, most MPE methods are designed from the viewpoint of signal processing rather than music cognition, resulting in a lack of sufficient underpinning theory and inefficient modelling. To tackle this issue, we propose a general framework in which music cognitions are used to guide the entire process of MPE. In the pre-processing, inspired by cognitive neuroscience of music [19], the Constant Q transform (CQT) [27] is employed to transfer the audio signal to time–frequency spectrogram. The pianoroll transcription is then generated using a conventional matrix factorization approach, shift-invariant probabilistic latent component analysis (SI-PLCA) [9]. In the harmonic structure detection (HSD) process, the cognitions of harmonic periodicity and instrument timbre [18] are used to guide the extraction of multiple pitches. The efficacy of the suggested methodologies has been fully validated by experiments on three publicly available datasets.
The major contributions of this paper may be highlighted as follows. First, a new HSD model that incorporates music cognition for multiple fundamental frequency extraction was proposed. Second, we proposed a new note tracking method guided by music connectivity and multi-pitch model. By combining conventional pianoroll transcription approaches and the proposed HSD model, a new music cognition–guided optimization framework is introduced for MPE. Experimental results on three datasets have demonstrated the merits of our approach, when benchmarked with 11 state-of-the-art methods.
The rest of the paper is structured as follows: “Cognition-guided multiple pitch estimation” describes pre-processing for MPE including time–frequency representation, matrix factorization and the implementation of the proposed harmonic structure detection method. “Experimental results” presents the experimental results and performance analysis. Finally, a thorough conclusion is drawn in “Conclusion”.

Cognition-Guided Multiple Pitch Estimation

System Overview

The objective of this work is to detect the multiple pitches from music pieces of mixed instruments, where an MPE system is proposed, which contains three key modules, i.e., pre-processing, harmonic structure detection and note tracking. Preprocessing covers a standard procedure, in which an input music signal needs to go through time–frequency (TF) representation and matrix factorization for feature extraction. The overall diagram of the MPE framework is illustrated in Fig. 1, where the implementation details are presented as follows.

Pre-processing

According to the cognitive neuroscience of music [19, 28], before selectively stimulating the auditory cortex, different frequencies within the music need to be first filtered by human cochlea. As the frequency of human auditory perception is logarithmically distributed [27], there is a greater discrimination when hearing relatively lower frequencies. The Constant Q transform (CQT) [29], based on the FFT principle, can process a logarithmic compression similar to that of human’s cochlea helical structure [29]. Therefore, the CQT is employed as the TF representation module to derive the TF spectrogram, as it is efficient in lower frequencies. There are fewer frequencies required in a given range, which has testified its usefulness when the frequency distribution in several octaves is discrete. Meanwhile, an increased frequency bin correlates to a decrease in the temporal resolution rate, making it suitable for auditory applications. A spectral resolution of 60 bins per octave is used as suggested by Brown [27]. The outputs from the TF transformation are linear when using the Fast Fourier Transform (FFT) to analyse the frequency (Fig. 2a).
In the matrix factorization module, the CQT spectrogram results are used as the input, approximately modelled as a bivariate probability distribution \({\varvec{P}}\left(p,t\right)\). The output of this module is a 2-dimensional non-binary representation of pianoroll transcription (a pitch vs. time matrix shown in Fig. 2b). In this paper, the fast shift-invariant probabilistic latent component analysis (SI-PLCA) [30] is used for automatic transcription of polyphonic music, as it is extremely useful for log-frequency spectrogram, due to the same inter-harmonic spacing for all periodic sounds [31]. Given an input signal \({{\varvec{X}}}_{t}\), the output of CQT is a log-frequency spectrogram \({{\varvec{V}}}_{z,t}\) that can be considered as a joint time–frequency distribution \({\varvec{P}}\left(z,t\right)\) where \(z\) and \(t\) denote the frequency and time, respectively. After applying the SI-PLCA, \({\varvec{P}}\left(z,t\right)\) can be further decomposed into several components by [30]:
$${{\varvec{V}}}_{z,t}={\varvec{P}}\left(z,t\right)={\varvec{P}}\left(t\right)\sum_{p,f,s}{\varvec{P}}\left(z-f|s,p\right){{\varvec{P}}}_{t}\left(f|p\right){{\varvec{P}}}_{t}\left(s|p\right){{\varvec{P}}}_{t}\left(p\right)$$
(1)
where \(p,f,s\) are latent variables which denote respectively the pitch index, pitch-shifting parameter and instrument source. In Eq. (1), \({\varvec{P}}\left(t\right)\) is the energy distribution of the spectrogram, which is known from the input signal. \({\varvec{P}}\left(z-f|s,p\right)\) denotes the spectral templates for a given pitch p and instrument source s with f pitch shifting across the log-frequency. \({{\varvec{P}}}_{t}\left(f|p\right)\) is the log-frequency shift for each pitch on the time frame t, \({{\varvec{P}}}_{t}\left(s|p\right)\) represents instrumentation contribution for the pitch in the time frame t, and \({{\varvec{P}}}_{t}\left(p\right)\) is the pitch contribution which can be considered as transcription matrix on the time frame t. Since there are latent variables in this model, the expectation maximization (EM) algorithm [20] is often used to iteratively estimate the corresponding unknown variables.
In the Expectation step, the Bayes’s theorem is adopted to estimate the contribution of the latent variables p, f, s for reconstruction of the model:
$${{\varvec{P}}}_{t}\left(p,f,s|z\right)=\frac{{\varvec{P}}\left(z-f|s,p\right){{\varvec{P}}}_{t}\left(f|p\right){{\varvec{P}}}_{t}\left(s|p\right){{\varvec{P}}}_{t}\left(p\right)}{\sum_{p,f,s}{\varvec{P}}\left(z-f|s,p\right){{\varvec{P}}}_{t}\left(f|p\right){{\varvec{P}}}_{t}\left(s|p\right){{\varvec{P}}}_{t}\left(p\right)}$$
(2)
In the Maximization step, the posterior of Eq. (2) is used to maximise the log-likelihood function in Eq. (3), which leads to the update of Eqs. (4)–(7). As suggested in [30], this step can converge after 15–20 iterations. The final result of the pianoroll transcription is derived by \({\varvec{P}}\left(p,t\right)={\varvec{P}}\left(t\right){{\varvec{P}}}_{t}\left(p\right)\):
$$\mathcal{L}=\sum_{z,t}{{\varvec{V}}}_{z,t}\ \mathrm{log}\ ({\varvec{P}}\left(z,t\right))$$
(3)
$${{\varvec{P}}}_{t}\left(z|s,p\right)=\frac{\sum_{f,t}{{\varvec{P}}}_{t}\left(p,f,s|z+f\right){\varvec{P}}\left(z+f,t\right)}{\sum_{f,w,t}{{\varvec{P}}}_{t}\left(p,f,s|z+f\right){\varvec{P}}\left(z+f,t\right)}$$
(4)
$${{\varvec{P}}}_{t}\left(f|p\right)=\frac{\sum_{z,s}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}{\sum_{f,z,s}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}$$
(5)
$${{\varvec{P}}}_{t}\left(s|p\right)=\frac{\sum_{z,f}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}{\sum_{s,z,f}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}$$
(6)
$${{\varvec{P}}}_{t}\left(p\right)=\frac{\sum_{z,f,s}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}{\sum_{p,z,f,s}{{\varvec{P}}}_{t}\left(p,f,s|z\right){\varvec{P}}\left(z,t\right)}$$
(7)

Harmonic Structure Detection

This section is the core of the proposed MPE system where music theories in terms of the pattern of beat length and assumption of equal energy between mixed monophonic and polyphonic music pieces are used to guide the model for the extraction of the multiple fundamental frequencies from a mixture of music sources.
For a given piece of music, the time domain representation is illustrated in the input module in Fig. 1. The results of CQT and SI-PLCA are given in Fig. 2a and b, respectively. Upon observing Fig. 2b, the fundamental pitch and its harmonics have been highlighted by the shaded black and grey strips. However, there is considerable noise and redundant information represented by small and grey dots which may be misconstrued for pitches at lower frequencies. Furthermore, the white gaps in the black and grey strips indicate frequency information that has been lost in the analysis. This suggests that the consistency of fundamental pitch is insufficient if considered frame by frame (each frame was set to 10 ms). To address these inconsistencies, the HSD method is proposed followed by a note tracking process (Fig. 1).
The proposed HSD includes two main stages. In the first stage, the pianoroll transcription \({\varvec{P}}\left(p,t\right)\) is normalised into \(\left[\mathrm{0,1}\right]\) by using the following max-mean sigmoid activation function [32]:
$${\varvec{P}}{\varvec{N}}=\frac{1}{1+{e}^{-{\varvec{z}}}}$$
(8)
$${\varvec{z}}=\frac{{\varvec{P}}\left(p,t\right)-\mathrm{mean}\ \left({\varvec{P}}\left(p,t\right)\right)}{\mathrm{max}\ \left({\varvec{P}}\left(p,t\right)\right)-\mathrm{min}\ \left({\varvec{P}}\left(p,t\right)\right)}$$
(9)
where \({\varvec{P}}{\varvec{N}}\) represents the normalised \({\varvec{P}}\left(p,t\right)\). By applying a mean filter in Eqs. (8) and (9), the spectrogram can be smoothed. For extreme values which are too large or too small than expected, they can also be rationalised. For any \({\varvec{P}}{\varvec{N}}\), the value of \({{\varvec{P}}{\varvec{N}}}_{t}\) at time t can be expressed by Eq. (10).
$${{\varvec{P}}{\varvec{N}}}_{t}=({{\varvec{P}}{\varvec{N}}}_{t-1}+{{\varvec{P}}{\varvec{N}}}_{t}+{{\varvec{P}}{\varvec{N}}}_{t+1})/3$$
(10)
$$\begin{array}{cc}{{\varvec{P}}{\varvec{F}}}_{t}={{\varvec{P}}{\varvec{N}}}_{t}\times s;& s=\left\{\begin{array}{lr}1,& if\ {\varvec{P}}{\varvec{N}}>{TH}_{1}\\ 0,& \mathrm{otherwise}\end{array}\right.\end{array}$$
(11)
Inspired by the music theory that most high-order harmonic components are in the high-frequency range with low amplitude [17], a two-step hard constrain is used to remove most of the high-frequency components, noise and redundancy. First, a fixed threshold \({TH}_{1}\) is applied in Eq. (11) to remove small values. Based on the characteristic of sigmoid function (Eq. (8)), \({TH}_{1}\) is set to 0.5. Finally, the filtered result \({\varvec{P}}{\varvec{F}}\) of the whole frames is obtained and shown in Fig. 3a.
In the second step, the statistics of the beat length is used to guide the removal of noise and redundant information. According to the cognition of music perception, most notes in musical rhythms have a large number of crotchets and quavers, but fewer numbers of semiquavers and demisemiquavers [33]. The rate of occurrence of different notes in the BACH10 database was observed and measured according to the ground truth. A plot was generated of time vs. rate of occurrence in Fig. 4, with the labelled fractions (i.e. \(\frac{1}{2},\frac{1}{4},\frac{1}{8},\frac{1}{16},\frac{1}{32}\)) denoting minim, crotchet, quaver, semiquaver and demisemiquaver, respectively. Figure 4 illustrates that the rate of occurrence of crotchets and quavers is larger than that of the demisemiquavers, semiquavers and minims. Especially, the number of demisemiquavers and semiquavers is extremely low. Furthermore, if the length of a semibreve is defined as \(\tau\), the length of a demisemiquaver is \(\tau /32\). Any notes shorter than a demisemiquaver will be removed in \({\varvec{P}}{\varvec{F}}\) before any further processing in the second stage.
In Fig. 4, a peak value is identified at the initial time steps of the simulation, and this may be due to two reasons. Firstly, manually played music may contain some timing errors, for example, holding a note for its precise duration for every note in the piece may be impossible. Secondly, ornaments such as vibrato and glissando may be mistakenly performed despite not being present on the music score. The length of such vibrato and glissando is equal to a demisemiquaver or lower [34]. To extract more of the main body of multiple pitches, factors such as human playing habits or ornaments are ignored in the proposed work. Relevant results given in “Experimental results” demonstrate that the multiple pitches are highlighted whilst removing most of the unwanted noise.
After filtering the amplitudes from PLCA, the HSD framework was proposed to detect the fundamental pitch in the second stage. The flowchart in Fig. 5 outlines the process of HSD, and Table 1 lists the description of each parameter. As described in the flowchart in Fig. 5, the output from previous steps will be analysed in two domains, i.e. pitch domain \({\varvec{P}}{\varvec{D}}\) and energy domain \({\varvec{E}}{\varvec{D}}\). In this context, each frame of \({\varvec{P}}{\varvec{F}}\) is split into two vectors, \({\varvec{P}}{\varvec{D}}(n)\) and \({\varvec{E}}{\varvec{D}}(n)\). \({\varvec{P}}{\varvec{D}}(n)\epsilon {\mathbb{R}}^{N*1}\) is non-zero notes index in each frame, \({\varvec{E}}{\varvec{D}}(n)\epsilon {\mathbb{R}}^{N*1}\) is the amplitude of \({\varvec{P}}{\varvec{D}}(n)\), and N is the number of non-zero notes. As seen, the process is only applied once on the non-zero notes rather than the whole frame, because there is no need to analyse those zero-value notes for efficiency.
Table 1
Description of parameters
Parameters
Definition
Index/Dimension
N
The number of non-zero fundamental-pitch;
\(n\epsilon \left[1,N\right]\)
H
The number of harmonic-pitch; default is 5
\(h\epsilon \left[1,H\right]\)
I
The number of the instruments in the music piece
\(i\epsilon \left[1,I\right]\)
m
Vector of pianoroll
\({\mathbb{R}}^{N*1}\)
PF
Spectrogram of SI-PLCA after filtering
\({\mathbb{R}}^{88*Time}\)
PD
Pitch value of PF
\({\mathbb{R}}^{N*1}\)
PCH
Value of pitch candidates and their corresponding harmonics
\({\mathbb{R}}^{N*H}\)
PCP
Value of harmonics and their potential corresponding pitches
\({\mathbb{R}}^{N*H}\)
PHC
Value of harmonics and selected pitches
\({\mathbb{R}}^{N*H}\)
ED
Energy value of PF
\({\mathbb{R}}^{N*1}\)
EDG
Amplitude of fundamental pitch and their corresponding harmonic
\({\mathbb{R}}^{N*H}\)
EHC
Amplitude of harmonic components presented in the pitch n
\({\mathbb{R}}^{N*H}\)
EFF
Final result of pitch amplitude
\({\mathbb{R}}^{N*1}\)

Pitch Domain Analysis

After that, a matrix of pitch candidates and their corresponding harmonics \({\varvec{P}}{\varvec{C}}{\varvec{H}}\epsilon {\mathbb{R}}^{N*H}\) can be extended from \({\varvec{P}}{\varvec{D}}(n)\). The first column of this matrix is non-zero pitch values and the rest of the columns have the associated harmonic pitches of each non-zero pitch, where the harmonic pitch is the corresponding pitch value of the harmonic frequency. A harmonic map \({\varvec{H}}{\varvec{M}}{\varvec{a}}{\varvec{p}}\epsilon {\mathbb{R}}^{M*H}\) is employed here to guide the extension process, which includes the pianoroll number (m) of the fundamental frequency (\({{\varvec{F}}}_{0}\)) and the corresponding harmonic frequency for every note. Following the MIDI tuning standard, we transfer the nth non-zero fundamental frequency to its corresponding pianoroll number using Eq. (12). Here, \({\varvec{P}}{\varvec{D}}\) needs to be subtracted by 20 due to the difference between the pianoroll and the MIDI number:
$$\begin{array}{c}{\varvec{P}}{\varvec{D}}\left(n\right)=69+12\ {\mathrm{log}}_{2}\ \left(\frac{{{\varvec{F}}}_{0}\left(n\right)}{440Hz}\right)\\ \begin{array}{cc}{\varvec{m}}\left(n\right)={\varvec{P}}{\varvec{D}}\left(n\right)-20,& \left|{\varvec{m}}\right| \epsilon \left[\mathrm{1,88}\right]\end{array}\end{array}$$
(12)
where 69 and 440 are the values of the MIDI number and frequency for the standard A, respectively. Twelve is the number of notes in one octave. Given a frequency of the input signal, its harmonic frequencies are multiples of the fundamental frequency. In this study, we set concert A as 440 Hz for fast implementation. Note that the concert A is not always the standard A with 440 Hz, where its frequency may vary depending on the playing style of the instruments and music pieces. It is worth mentioning that our algorithm does not reply on the frequency setting of concert A, as our algorithm focuses on the analysis of the relationship between fundamental frequency and harmonic frequencies, which mainly depends on the music temperament.
An example of calculating MIDI number of harmonic frequency in \({\varvec{H}}{\varvec{M}}{\varvec{a}}{\varvec{p}}\) is given in Table 2.
Table 2
Example of calculating A4 in the HMap
Attribute
Fundamental frequency,\({F}_{0}\)
Harmonic Frequency, k × \({F}_{0}\) (Hz)
2 \({F}_{0}\)
3 \({F}_{0}\)
4 \({F}_{0}\)
5 \({F}_{0}\)
Frequency (Hz)
440
880
1320
1760
2200
Pianoroll
49
61
68
73
77
MIDI number
69
81
88
93
97
Letter name
A4
A5
E6
A7
C#7/Db7
\({\varvec{P}}{\varvec{C}}{\varvec{H}}(n,h)\) is the \({h}^{th}\) harmonic pitch component of the pitch n where n lies within [1, N] and h is within [1, H]. H is set to 5 in the experiment, and N is the number of non-zero value in each frame:
$$\begin{array}{cc}{\varvec{P}}{\varvec{C}}{\varvec{H}}\left(n,h\right)={\varvec{H}}{\varvec{M}}{\varvec{a}}{\varvec{p}}\left({\varvec{m}}\left(n\right), h\right),& {\varvec{P}}{\varvec{C}}{\varvec{H}} \epsilon {\mathbb{R}}^{N\times H}\end{array}$$
(13)
Let \({\varvec{P}}{\varvec{C}}{\varvec{P}}\) be a matrix of the harmonics and their potential corresponding pitches, which contains the harmonic components and their associated pitches being calculated from the original pitch at a specific value of h as follows:
$$\delta \left(x-y\right)=\left\{\begin{array}{lr}1,& \mathrm{if}\ x=y\\ 0,& \mathrm{otherwise}.\end{array}\right.$$
(14)
$$\begin{array}{cc}{\varvec{P}}{\varvec{C}}{\varvec{P}}\left(n,h\right)={\varvec{P}}{\varvec{C}}{\varvec{H}}\left(n,h\right)\cdot \delta \left[{\varvec{P}}{\varvec{C}}{\varvec{H}}\left(n,h\right)-{\varvec{P}}{\varvec{C}}{\varvec{H}}\left(n,1\right)\right],& {\varvec{P}}{\varvec{C}}{\varvec{P}}\boldsymbol{ }\epsilon {\mathbb{R}}^{N\times H}\end{array}$$
(15)
where \(\delta (x-y)\) is a function of the equivalence gate with two inputs. The output of the equivalence gate will be 1 if the two inputs equals (i.e. h = 1). Otherwise, it will become zero. Using Eqs. (14) and (15), \({\varvec{P}}{\varvec{C}}{\varvec{P}}\left(n,h\right)\) can be identified for each harmonic component.
Let \({\varvec{P}}{\varvec{H}}{\varvec{C}}(n,1)\) be a harmonic component, and \({\varvec{P}}{\varvec{H}}{\varvec{C}}(n,h) \left(h=2,\dots ,H\right)\) represents the relative associated pitches. \({\varvec{P}}{\varvec{H}}{\varvec{C}}\) is the value that correlates to \({\varvec{P}}{\varvec{C}}{\varvec{P}}\) in identifying potentially the original pitch values. The matrix for all of the potentially original pitch values is estimated below. If \({\varvec{P}}{\varvec{C}}{\varvec{P}}(n,h)={\varvec{P}}{\varvec{C}}{\varvec{P}}\left(n,1\right)\), an equivalence gate value of 1 is assigned, and the output value from the square brackets becomes 1 in Eq. (16):
$$\begin{array}{c}{\boldsymbol{PHC}}\left(n,h\right)={\boldsymbol{PCP}}\left(n,1\right)\cdot \delta \left[{\boldsymbol{PCP}}\left(n,h\right)-{\boldsymbol{PCP}}\left(n,1\right)\right],\\ \begin{array}{cc}{\boldsymbol{PHC}}\ \epsilon\ {\mathbb{R}}^{N\times H},& n\ \epsilon\ \left[1,N\right],\ h\ \epsilon\ \left[1,H\right]\end{array}\end{array}$$
(16)

Energy Domain Analysis

In the energy domain, \({\varvec{E}}{\varvec{D}}{\varvec{G}}(n,h)\) is a value generated from \({\varvec{E}}{\varvec{D}}\epsilon {\mathbb{R}}^{N*H}\) and \({\varvec{P}}{\varvec{H}}{\varvec{C}}\left(n,h\right)\) as defined below:
$$\begin{array}{cc}{\varvec{E}}{\varvec{D}}{\varvec{G}}\left(n,h\right)={\varvec{E}}{\varvec{D}}\left(n\right)\cdot \delta \left[{\varvec{P}}{\varvec{H}}{\varvec{C}}\left(n,h\right)-{\varvec{P}}{\varvec{H}}{\varvec{C}}\left(n,1\right)\right],& {\varvec{E}}{\varvec{D}}{\varvec{G}}\boldsymbol{ }\epsilon {\mathbb{R}}^{N\times H}\end{array}$$
(17)
In the following, we will describe two cognitive theories which have inspired our proposed guided weight mechanism for fundamental frequency detection. First, according to the harmonic periodicity and instrument timbre theory [18], the harmonic periodicity of different instruments should be the same, although the sound of which varies by their timbres as reflected on the ratio of harmonic amplitude to the fundamental amplitude [35]. The instruments from different families will have a large ratio, and vice versa. For the instrument that produces a sound from strings such as piano, and violin (Fig. 6d), their harmonic amplitudes generally decrease gradually. On a different note, for woodwind instruments such as clarinet (Fig. 6c) and bassoon (Fig. 6a), the amplitudes of their first harmonic would be lower than that of their second harmonic. Therefore, the energy ratio of the fundamental frequency and harmonic frequency energy (timbre) is unaffected by monophonic or polyphonic textures, but unique in individual instruments. Second, according to the acoustic theory [36], when two or more sound waves occupy the same space, they move through rather than bounce off each other. For example, the result of any combination of sound waves is simply the addition of these waves. Theoretically, the energy of the mixed monophonic and polyphonic audio should be the same, though there is unavoidable difference in the real case. The results of a single frame after step 1 (section III-B) of the harmonic structure detection (HSD) are plotted as profile of pitch values as shown in Fig. 6. The profiles of four single music sources are shown in Fig. 6a–d. The profile of the mixed monophonic notes is given in Fig. 6e, which is composed of four single music sources, i.e. notes no. 1–no. 4, and the profile of the polyphonic notes shown in Fig. 6f is generated from one mixed channel. Considering that the profile of mixed monophonic notes is the ideal value, and the profile of the polyphonic notes is the predicted actual value. As seen in Fig. 6f, there are few amplitude differences between the profiles of the polyphonic and monophonic notes due to the resonance in the polyphonic notes and channel distortion during data recording and transmission, but the overall trend of the two profiles is very similar.
Motivated by these, we proposed the guided weight mechanism which is denoted as Eq. (18) in our model for improving the detection of the fundamental frequency. The guiding weight is calculated by the averaged ratio of the amplitude of harmonic \({\varvec{E}}{\varvec{D}}\_{\varvec{m}}{\varvec{o}}{\varvec{n}}{\varvec{o}}\left(h\right)\) and fundamental frequency \({\varvec{E}}{\varvec{D}}\_{\varvec{m}}{\varvec{o}}{\varvec{n}}{\varvec{o}}\left(1\right)\) in the monophonic data, before applying to the polyphonic data. The variable \(I\) is the number of known instruments that can be identified in the music piece:
$$\begin{array}{cc}{{\boldsymbol{W}}}_{i}\left(h\right)=\frac{1}{T}\sum\limits_{t=1}^{T}\frac{{{\boldsymbol{ED}}\_{\boldsymbol{mono}}}_{t}\left(h\right)}{{{\boldsymbol{ED}}\_{\boldsymbol{mono}}}_{t}\left(1\right)},& h\ \epsilon\ \left[1,H\right],\ i\ \epsilon\ \left[1,I\right]\end{array}$$
(18)
where T is the number of time frames in the monophonic data, the first non-zero value of \({{\varvec{E}}{\varvec{D}}\_{\varvec{m}}{\varvec{o}}{\varvec{n}}{\varvec{o}}}_{t}\left(1\right)\) is always the fundamental frequency, and the remaining non-zero values \({{\varvec{E}}{\varvec{D}}\_{\varvec{m}}{\varvec{o}}{\varvec{n}}{\varvec{o}}}_{t}\left(h\right)\) are the harmonic frequencies.
Equation (19) estimates the amplitude of harmonic components (\({\varvec{E}}{\varvec{H}}{\varvec{C}}\)) presented in the pitch n by multiplying the guided weight of selected instrument with \({\varvec{E}}{\varvec{D}}{\varvec{G}}\). Theoretically, the amplitude of harmonic should be a portion to the amplitude of the fundamental frequencies. It is noted that the fundamental frequencies must occur at h = 1, then harmonic frequencies occur at h = 2:H.
$$\begin{array}{c}{{\boldsymbol{EHC}}}_{i}\left(n,h\right)={\boldsymbol{EDG}}\left(n,h\right)\cdot {{\boldsymbol{W}}}_{i}\left(h\right),\\ \begin{array}{cc}{{\boldsymbol{EHC}}}_{i}\epsilon {\mathbb{R}}^{N*H},& n\ \epsilon\ \left[1,N\right],\ h\ \epsilon\ \left[1,H\right]\end{array}\end{array}$$
(19)
Based on the \({{\varvec{E}}{\varvec{H}}{\varvec{C}}}_{i}\) determined from Eq. (19), the amplitude of fundamental frequency in pitch n after subtracting the summed harmonic components’ amplitude will be kept updating until the fundamental frequencies from all instruments are estimated.
$${\varvec{ED}}\left(n\right)={{\varvec{EHC}}}_{i}\left(n,1\right)-{\sum }_{h=2}^{H}{{\varvec{EHC}}}_{i}\left(n,h\right)$$
(20)
Eventually, the amplitude of fundamental frequency in pitch n, represented as \({\varvec{E}}{\varvec{F}}{\varvec{F}}\), can be obtained by Eq. (21).
$$\begin{array}{cc}{\varvec{EFF}}\left(n\right)={\varvec{ED}}\left(n\right),& {\varvec{EFF}}\epsilon {\mathbb{R}}^{N\times 1}\end{array}$$
(21)
For each non-zero pitch n in each frame t, it will have a rank value \({\varvec{R}}\left(n\right)\) according to the \({\varvec{E}}{\varvec{F}}{\varvec{F}}(n)\), then a 2D rank map \(R\left(n,t\right)\) will be generated for the whole music piece, i.e. pitch/pianoroll vs. time frame as shown in Fig. 3b, which will be used to fully represent our detected harmonic structure. A brief implementation of energy domain procedure is summarized in Algorithm 1.
Algorithm 1
 
Inputs:\({\varvec{E}}{\varvec{D}}({\varvec{n}})\)
Step 1: Generate a matrix including the amplitude of fundamental pitch and their corresponding harmonic pitches using Eq. (17)
Step 2: Calculate the weight for each type of instrument using Eq. (18)
Step 3: Estimate the amplitude of harmonic components (\({\varvec{E}}{\varvec{H}}{\varvec{C}}\)) presented in the pitch n using Eq. (19)
Step 4: Update \({\varvec{E}}{\varvec{D}}\) by Eq. (20)
Step 5: Repeat steps 1–4 until the fundamental frequencies from all instruments are estimated
  Obtain the final estimated amplitude of fundamental frequency in pitch n by Eq. (21)

Note Tracking

As seen in Fig. 3b, although most fundamental pitches have been extracted, the notes still show a poor consistency. To improve this, a note tracking method based on the music perception and multi-pitch probability weight was proposed. According to the music theory [33], the occurrence of demisemiquaver is generally quite low in music pieces. As a result, notes with a length shorter than demisemiquaver are filtered out. The averaged rank of the connected pitch group in the rank map is calculated and denoted as \(\overline{R }\). If \(\overline{R }\) is larger than an adaptive threshold \({TH}_{2}\), the pitch group is considered a harmonic and will be skipped from the analysis. As the polyphonic music pitches vary over time, the \({TH}_{2}\) will also change accordingly. To account for this change, a fitting function was generated for \({TH}_{2}\) (Fig. 7a), which is adaptive to the number of notes \(x\in \left[\mathrm{1,12}\right]\) for each frame, as given:
$${TH}_{2}=1.26{x}^{0.9}$$
(22)
The fitting curve of \({TH}_{2}\) is obtained by minimising the fitting error between ground truth and our estimate. Figure 7b displays the note tracking results where most of the noise and the inconsistencies have been filtered out. The result has also achieved a similar profile to that of the ground truth data.

Experimental Results

Experimental Settings

To validate the effectiveness of the proposed approach, the first dataset used for evaluation is the MIDI Aligned Piano Sounds (MAPS) [37], in which all music pieces are recorded in the MIDI format initially and then converted into “.wav” format. MAPS also have differently purposed subsets such as monophonic excerpts and chords. For this case, only one subset is used which includes polyphonic music pieces. In addition, there are several instruments and recording conditions in MAPS. The “ENSTDkCI” is chosen as the music played using a real piano rather than an acoustic one, i.e. a virtual instrument, and recording occurs in soundproofed conditions. The second dataset is BACH10 [38], which contains 10 pieces using violin, clarinet, saxophone and bassoon from J.S.Bach chorales, where each piece lasts approximately 30 s. The third dataset is TRIOS [39], which is the most complex one among the three as it contains five multitrack chamber music trio pieces. The sampling rate for all music pieces is 44,100 Hz.
For objective assessment, the most commonly used frame-based metric, F-measure (\({F}_{1}\)) [40, 41], is adopted. It combines the positive predictive value (PPV, also namely precision) and the true positive rate (TPR, also namely recall) for a comprehensive evaluation as follows:
$${F}_{1}=\frac{2\cdot PPV\cdot TPR}{PPV+TPR}$$
(23)
where \(TPR=\frac{{T}_{p}}{{T}_{p}+{F}_{n}}\)\(PPV=\frac{{T}_{p}}{{T}_{p}+{F}_{p}}\), and \({T}_{p}\), \({F}_{p}\) and \({F}_{n}\) refer respectively to the number of correctly detected \({F}_{0}\), incorrectly detected \({F}_{0}\) and missing detection of the \({F}_{0}\). Specifically, these three components can be calculated by comparing the binary masks of the detected MPE results and the ground truth.

Performance Evaluation

Table 3 shows the quantitative assessment of 11 benchmarking methods on MAPS, BACH10 and TRIOS datasets. We divide all benchmarking methods into two categories: shallow learning method and DL method. Weak learning methods include a traditional machine learning model or a prior knowledge-based model whereas DL methods include deep neural networks and deep convolutional neural networks.
Table 3
Frame-level performance of different methods on three datasets
Category
Methods
\({F}_{1}\)
MAPS
BACH10
TRIOS
 
Benetos [43]
64.17
68.40
66.46
 
Benetos [31]
59.31
70.57
64.93
 
Vincent [42]
72.35
79.78
59.40
Shallow
Duan [38]
67.41
70.90
45.80
learning
Klapuri [3]
60.10
68.30
50.50
 
CFP [8]
68.67
85.51
64.64
 
SONIC [44]
63.60
66.49
56.65
 
HSD(proposed)
76.30
80.17
67.63
 
ConvNet [15]
64.14
Deep
RNN [15]
57.67
learning
Li [40]
69.42
 
INN [41]
72.29
Top two methods are bold with the second also italic
Many MPE approaches select a pair of methods from CQT, PLCA, equivalent rectangular bandwidth (ERB) and NMF for pianoroll transcription. Therefore, two of the most representative methods, i.e. CQT + PLCA proposed by Benotos and Dixon [31] and ERB + NMF proposed by Vincent et al. [42], are chosen for benchmarking. In Table 3, Benetos et al. [43] and Vincent [42] can produce the second best performance on the MAPS and TRIOS datasets, respectively, which validates the effectiveness of CQT + PLCA and ERB + NMF. However, due to the lack of efficient harmonic analysis, the performance of both methods is inferior to the proposed HSD method. Unlike the methods from Benetos and Vincent, other methods adopt different ideas for MPE. SONIC [44] proposed a connectionist approach where an adaptive oscillator network was used to track the partials in the music signal. However, without a matrix factorization process, its performance is limited on the three datasets. Su and Yang [8] proposed a combined frequency and periodicity (CFP) method to detect the pitch in both frequency domain and lag (frequency) domain. The CFP method in Table 3 gives the best performance on the BACH10 dataset, but relatively poorer results on the other two datasets. The main reason here is possibly because the music pieces in the MAPS and TRIOS datasets have more short notes than those in the BACH10 dataset, and CFP has the limited ability for detecting the short notes but exhibit less errors for continuous long notes. Furthermore, the assumption of CFP does not hold for high-pitch notes of piano, as both MAPS and TRIOS have many piano music pieces. In addition, the music pieces in the MAPS database contain multiple notes in most frames, which have led to extra difficulty for polyphonic detection. However, the proposed method can still successfully solve this problem by effectively analysing the relationship of the position and energy between the fundamental frequency and harmonic frequencies for the notes. As a result, the performance of the proposed method on MAPS is the best, which is 8% higher than that of CFP. Klapuri [3] proposed an auditory model–based \({F}_{0}\) estimator, and Duan [38] proposed a maximum-likelihood approach for multiple \({F}_{0}\) estimation, but both methods result in inferior performance compared to the results achieved by Benetos et al. [31, 43], Vincent et al. [42] or CFP [8]. Furthermore, Klapuri’s [3] and Duan et al.’s [38] methods lack an effective pre-processing stage (i.e. TF representation and matrix factorization) or harmonic analysis, which is the main reason why their overall performance is less effective in comparison to ours.
The proposed method was also compared with four deep learning–based supervised approaches on MAPS dataset. Due to lack of publicly available source codes, only the data that was reported in the original paper was duplicated for comparison. The first two methods are proposed by Sigtia et al. [15], which are mainly based on the music language models (MLMs). However, due to insufficiently labelled data in the existing polyphonic music databases for training, such limitations have affected further analysis of DL-based approaches. Furthermore, the MLM model is not robust to ambient noise, whereas music pieces in reality generally contain a lot of ambient noise. This has resulted in DL-based methods failing to fully analyse the inner structure of the music pieces. As a result, DL-based methods cannot achieve the same performance as the HSD method or some of the other unsupervised methods such as Benetos et al. [43] on the MAPS dataset. Su [40] and Kelz [41] also proposed DL-based methods for AMT. Although better than [15], their performance is still not ideal as there is insufficient music knowledge support embedded. To this end, more music theories should be introduced for improved AMT.
In summary, referring to Table 3, the proposed method yields the best results on both the MAPS and TRIOS datasets, also the second-best in BACH10 according to \({F}_{1}\) value, thanks to the guidance of music cognition. However, the method can still be improved, especially for reducing the computation cost. As it takes 2 min to process a 30-s music piece, this is longer than some other methods. In addition, although the profile of the real polyphonic note is close to the expected mixed monophonic note, as shown in Fig. 6e, f, there are still some differences in the final values of the monophonic and polyphonic profiles which can be further improved.

Key Stage Analysis

In this section, the contribution of several major stages in the proposed MPE system is discussed, where the performance of each stage is evaluated on the MAPS dataset in terms of the precision, recall and \({F}_{1}\). To calculate these three metrics, the result of each stage is normalised by using Eqs. (8) and (9), and the results are binarized with a fixed threshold value of 0.5.
We generalize our proposed MPE system into four key stages detailed as follows:
  • Stage A: The transcription map from SI-PLCA and CQT.
  • Stage B: The result after applying the first-step HSD.
  • Stage C: The result after applying the second-step HSD.
  • Stage D: The result after applying note tracking.
Table 4 illustrates the details of the system configurations. By combination of different key stages, the corresponding system is built up for evaluation. Each stage has specific components which are indispensable to the results of the system. Stage A shows the highest recall and lowest precision after applying CQT and SI-PLCA. The presence of \({F}_{0}\) and harmonics is all detected; however, many amplitudes are concentrated in higher frequency (harmonic) regions which inhibits the identification of \({F}_{0}\). After combining stage B, the recall value decreases by 0.03%, but the precision value increases by almost 3%. This is mainly due to the removal of noise in HSD. In stage C, the core of the MPE system contributes to an increase of nearly 30% for precision and 15–18% for \({F}_{1}\) compared to previous combinations. Finally, after applying the proposed note tracking step (stage D), the recall value is further improved by 5.5% which leads to the final \({F}_{1}\) value improved by 3.8% compared to the previous stage.
Table 4
System configuration
Configurations
Precision
Recall
F-measure
A
0.408
0.879
0.545
A + B
0.438
0.876
0.571
A + B + C
0.747
0.718
0.725
A + B + C + D
0.753
0.773
0.763
The bold one indicates the best performed result in the column

Assessment of CQT and ERB

In our proposed MPE system, CQT is employed to model the human cochlea perception. However, cochlea perception is not always constant in Q. Therefore, apart from CQT, the equivalent rectangular bandwidth (ERB) method is also widely used for time–frequency transform [42]. As most ERB methods are actually based on the Gamma tone filter-bank to model the human auditory system [45], it decomposes a signal and passes it through a bank of gamma tone filters, equally spaced on the equivalent rectangular bandwidth (ERB) scale. However, ERB methods may not be necessary to produce better MPE performance than CQT. To further validate this assumption, we have combined CQT [27] and ERB [42] pair-wisely with PLCA [43] and NMF [42] to form four hybrid methods, i.e. CQT + PLCA, CQT + NMF, ERB + PLCA and ERB + NMF, for quantitative analysis in terms of the precision-recall, ROC, F-measure curve (Fig. 8), AUC, MAE and maxF (Table 5). Here AUC, MAE, and maxF denote respectively the area under the ROC curve, the mean average error and the max value of F-measure curve. These three criteria have the same importance. As seen in Fig. 8, the ERB + NMF and CQT + PLCA show comparable results; both outperform the other two methods. In Table 5, although ERB + NMF gives the best maxF value, CQT + PLCA gives the best AUC and lowest MAE, indicating a smaller false alarm. Therefore, CQT + PLCA is the best among these four methods, which is also the main reason why it is used in our proposed MPE system.
Table 5
Time-freqeuncy transform and piano-roll transcription comparison
Methods
AUC
MAE
maxF
ERB + PLCA
0.922
0.0403
0.6687
ERB + CNMF
0.939
0.0487
0.7213
CQT + PLCA
0.942
0.0390
0.7089
CQT + CNMF
0.906
0.0411
0.6296
The bold one indicates the best performed result in the column

Conclusion

In this paper, a harmonic analysis method is proposed for the MPE system, inspired by music cognition and perception. CQT and SI-PLCA are employed in the pre-processing stage for pianoroll transcription in mixture music audio signal, from which the proposed HSD is used to extract the multi-pitch pianorolls. The proposed MPE system is not limited by the number of instruments. For multi-instrument cases (i.e. symphony in BACH10 and TRIOS datasets), the mixture characteristics of each instrument can be extracted for adaptive detection of the fundamental frequencies. From the experiment results, the proposed MPE system yields the best performance on the MAPS and TRIOS datasets, and the second-best on the BACH10 dataset. Through investigation of the performance of key components, the HSD provided the greatest contribution to the system, which validates the value of adding an efficient harmonic analysis model for improving significantly the performance of the MPE system. Furthermore, adding note tracking can further improve the efficacy of the MPE system.
However, the proposed MPE system still has much room to improve. First, it is worth mentioning that the expectation maximization (EM) algorithm has some limitations, especially the low convergence speed, sensitive to initial settings and inherent non-convex caused local optimum. As a result, it makes PLCA very time consuming, even unsuitable for processing large datasets. Therefore, how to better select the initial value and speed up the convergence can be a valuable work for future investigation. Second, the assumption of knowing the type of instruments in the music pieces is often unrealistic in real scenarios. Therefore, blind source separation can be integrated in our model to tackle this limitation. Third, analysis of the beat and chord along with integrated deep-learning models such as transformer networks [46] and long-short term memory [47] can be considered to further enhance the accuracy of pitch estimation. On the other hand, introducing more music perceptions such as ornaments and rhythm into the model will be helpful for more precise interpreting of the music pieces. Furthermore, an improved note tracking process can be introduced by fusing self-attention [48] and natural language processing model [49]. Finally, testing on larger datasets such as MusicNet [50] and MAESTRO [51] will be beneficial for more comprehensive modelling and validation.

Acknowledgements

This work is partially supported by the National Nature Science Foundation of China (grant 61876125) and the University of Strathclyde PhD Studentship.

Declarations

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of interest

The authors declare no competing interests.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Benetos E, Dixon S, Duan Z, Ewert S. Automatic music transcription: an overview. IEEE Signal Process Mag. 2018;36(1):20–30.CrossRef Benetos E, Dixon S, Duan Z, Ewert S. Automatic music transcription: an overview. IEEE Signal Process Mag. 2018;36(1):20–30.CrossRef
2.
Zurück zum Zitat Emiya V, Badeau R, David B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans Audio Speech Lang Process. 2010;18(6):1643–54.CrossRef Emiya V, Badeau R, David B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans Audio Speech Lang Process. 2010;18(6):1643–54.CrossRef
3.
Zurück zum Zitat Klapuri A. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans Audio Speech Lang Process. 2008;16(2):255–66.CrossRef Klapuri A. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans Audio Speech Lang Process. 2008;16(2):255–66.CrossRef
4.
Zurück zum Zitat Bay M, A Ehmann F, Downie JS. Evaluation of multiple-F0 estimation and tracking systems. In: ISMIR; 2009. pp 315–20. Bay M, A Ehmann F, Downie JS. Evaluation of multiple-F0 estimation and tracking systems. In: ISMIR; 2009. pp 315–20.
5.
Zurück zum Zitat Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A. Automatic music transcription: challenges and future directions. J Intel Inf Syst. 2013;41(3):407–34.CrossRef Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A. Automatic music transcription: challenges and future directions. J Intel Inf Syst. 2013;41(3):407–34.CrossRef
6.
Zurück zum Zitat Chunghsin Y. Multiple fundamental frequency estimation of polyphonic recordings. University Paris 6; 2008. Ph. D. dissertation. Chunghsin Y. Multiple fundamental frequency estimation of polyphonic recordings. University Paris 6; 2008. Ph. D. dissertation.
7.
Zurück zum Zitat Benetos E, Dixon S. Joint multi-pitch detection using harmonic envelope estimation for polyphonic music transcription. IEEE Journal of Selected Topics in Signal Processing. 2011;5(6):1111–23.CrossRef Benetos E, Dixon S. Joint multi-pitch detection using harmonic envelope estimation for polyphonic music transcription. IEEE Journal of Selected Topics in Signal Processing. 2011;5(6):1111–23.CrossRef
8.
Zurück zum Zitat Su L, Yang Y-H. Combining spectral and temporal representations for multipitch estimation of polyphonic music, IEEE/ACM Transactions on Audio. Speech and Language Processing (TASLP). 2015;23(10):1600–12. Su L, Yang Y-H. Combining spectral and temporal representations for multipitch estimation of polyphonic music, IEEE/ACM Transactions on Audio. Speech and Language Processing (TASLP). 2015;23(10):1600–12.
9.
Zurück zum Zitat Fuentes B, Badeau R, Richard G. Adaptive harmonic time-frequency decomposition of audio using shift-invariant PLCA. In: Proc. ICASSP; 2011. p. 401–4. Fuentes B, Badeau R, Richard G. Adaptive harmonic time-frequency decomposition of audio using shift-invariant PLCA. In: Proc. ICASSP; 2011. p. 401–4.
10.
Zurück zum Zitat Vincent E, Plumbley MD. Efficient Bayesian inference for harmonic models via adaptive posterior factorization. Neurocomputing. 2008;72(1–3):79–87.CrossRef Vincent E, Plumbley MD. Efficient Bayesian inference for harmonic models via adaptive posterior factorization. Neurocomputing. 2008;72(1–3):79–87.CrossRef
11.
Zurück zum Zitat Cheuk KW, Luo Y-J, Benetos E, Herremans D. The effect of spectrogram reconstruction on automatic music transcription: an alternative approach to improve transcription accuracy. In: Proc. ICPR; 2021. p. 9091–8. Cheuk KW, Luo Y-J, Benetos E, Herremans D. The effect of spectrogram reconstruction on automatic music transcription: an alternative approach to improve transcription accuracy. In: Proc. ICPR; 2021. p. 9091–8.
12.
Zurück zum Zitat Mukherjee H, Obaidullah SM, Phadikar S, Roy K. MISNA-a musical instrument segregation system from noisy audio with LPCC-S features and extreme learning. Multimedia Tools and Applications. 2018;77(21):27997–8022.CrossRef Mukherjee H, Obaidullah SM, Phadikar S, Roy K. MISNA-a musical instrument segregation system from noisy audio with LPCC-S features and extreme learning. Multimedia Tools and Applications. 2018;77(21):27997–8022.CrossRef
13.
Zurück zum Zitat Mukherjee H, Dhar A, Obaidullah SM, Santosh K, Phadikar S, Roy K. Segregating musical chords for automatic music transcription: a LSTM-RNN approach. In: International Conference on Pattern Recognition and Machine Intelligence. Springer; 2019. p. 427–35.CrossRef Mukherjee H, Dhar A, Obaidullah SM, Santosh K, Phadikar S, Roy K. Segregating musical chords for automatic music transcription: a LSTM-RNN approach. In: International Conference on Pattern Recognition and Machine Intelligence. Springer; 2019. p. 427–35.CrossRef
14.
Zurück zum Zitat Fan Z-C, Jang J-SR, Lu C-L. Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking. In: Proc. Multimedia Big Data (BigMM); 2016. p. 178–85. Fan Z-C, Jang J-SR, Lu C-L. Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking. In: Proc. Multimedia Big Data (BigMM); 2016. p. 178–85.
15.
Zurück zum Zitat Sigtia S, Benetos E, Dixon S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP). 2016;24(5):927–39.CrossRef Sigtia S, Benetos E, Dixon S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP). 2016;24(5):927–39.CrossRef
16.
Zurück zum Zitat Yan Y, et al. Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement. Pattern Recogn. 2018;79:65–78.CrossRef Yan Y, et al. Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement. Pattern Recogn. 2018;79:65–78.CrossRef
17.
Zurück zum Zitat Pichevar R, Rouat J. Monophonic sound source separation with an unsupervised network of spiking neurones. Neurocomputing. 2007;71(1–3):109–20.CrossRef Pichevar R, Rouat J. Monophonic sound source separation with an unsupervised network of spiking neurones. Neurocomputing. 2007;71(1–3):109–20.CrossRef
18.
Zurück zum Zitat Fletcher NH, Rossing TD. The physics of musical instruments. Springer Science & Business Media; 2012.MATH Fletcher NH, Rossing TD. The physics of musical instruments. Springer Science & Business Media; 2012.MATH
19.
Zurück zum Zitat Justus TC, Bharucha JJ. Music perception and cognition. In: Stevens’ Handbook of Experimental Psychology, Sensation and Perception. John Wiley & Sons Inc; 2002. p. 453. Justus TC, Bharucha JJ. Music perception and cognition. In: Stevens’ Handbook of Experimental Psychology, Sensation and Perception. John Wiley & Sons Inc; 2002. p. 453.
20.
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.MathSciNetMATH Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.MathSciNetMATH
21.
Zurück zum Zitat Bernardo JM, Smith AF. Bayesian theory. IOP Publishing; 2001. Bernardo JM, Smith AF. Bayesian theory. IOP Publishing; 2001.
22.
Zurück zum Zitat Emiya V, Badeau R, David B. Multipitch estimation of quasi-harmonic sounds in colored noise. In: 10th Int. Conf. on Digital Audio Effects (DAFx-07); 2007. Emiya V, Badeau R, David B. Multipitch estimation of quasi-harmonic sounds in colored noise. In: 10th Int. Conf. on Digital Audio Effects (DAFx-07); 2007.
23.
Zurück zum Zitat Duan Z, Temperley D. Note-level music transcription by maximum likelihood sampling. In: ISMIR; 2014. p. 181–6. Duan Z, Temperley D. Note-level music transcription by maximum likelihood sampling. In: ISMIR; 2014. p. 181–6.
24.
Zurück zum Zitat Alvarado Duran PA. Acoustically inspired probabilistic time-domain music transcription and source separation. Queen Mary University of London; 2020. Alvarado Duran PA. Acoustically inspired probabilistic time-domain music transcription and source separation. Queen Mary University of London; 2020.
25.
Zurück zum Zitat Nishikimi R, Nakamura E, Itoyama K, Yoshii K. Musical note estimation for F0 trajectories of singing voices based on a Bayesian semi-beat-synchronous HMM. In: ISMIR; 2016. p. 461–7. Nishikimi R, Nakamura E, Itoyama K, Yoshii K. Musical note estimation for F0 trajectories of singing voices based on a Bayesian semi-beat-synchronous HMM. In: ISMIR; 2016. p. 461–7.
26.
Zurück zum Zitat Gowrishankar BS, Bhajantri NU. An exhaustive review of automatic music transcription techniques: survey of music transcription techniques. In: Proc. Signal Processing, Communication, Power and Embedded System; 2016. p. 140–52. Gowrishankar BS, Bhajantri NU. An exhaustive review of automatic music transcription techniques: survey of music transcription techniques. In: Proc. Signal Processing, Communication, Power and Embedded System; 2016. p. 140–52.
27.
Zurück zum Zitat Brown JC. Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America. 1991;89(1):425–34.CrossRef Brown JC. Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America. 1991;89(1):425–34.CrossRef
28.
Zurück zum Zitat Bendor D, Wang X. The neuronal representation of pitch in primate auditory cortex. Nature. 2005;436(7054):1161–5.CrossRef Bendor D, Wang X. The neuronal representation of pitch in primate auditory cortex. Nature. 2005;436(7054):1161–5.CrossRef
29.
Zurück zum Zitat Schörkhuber C, Klapuri A. Constant-Q transform toolbox for music processing. In: 7th Sound and Music Computing Conference, Barcelona, Spain; 2010. p. 3–64. Schörkhuber C, Klapuri A. Constant-Q transform toolbox for music processing. In: 7th Sound and Music Computing Conference, Barcelona, Spain; 2010. p. 3–64.
30.
Zurück zum Zitat Smaragdis P, Brown JC. Non-negative matrix factorization for polyphonic music transcription. In: Proc. Applications of Signal Processing to Audio and Acoustics; 2003. p. 177–80. Smaragdis P, Brown JC. Non-negative matrix factorization for polyphonic music transcription. In: Proc. Applications of Signal Processing to Audio and Acoustics; 2003. p. 177–80.
31.
Zurück zum Zitat Benetos E, Dixon S. A shift-invariant latent variable model for automatic music transcription. Comput Music J. 2012;36(4):81–94.CrossRef Benetos E, Dixon S. A shift-invariant latent variable model for automatic music transcription. Comput Music J. 2012;36(4):81–94.CrossRef
32.
Zurück zum Zitat Han J, Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proc. Artificial Neural Networks; 1995. p. 195–201. Han J, Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proc. Artificial Neural Networks; 1995. p. 195–201.
33.
Zurück zum Zitat Smith LM. A multiresolution time-frequency analysis and interpretation of musical rhythm. Australia: University of Western Australia Perth; 2000. Smith LM. A multiresolution time-frequency analysis and interpretation of musical rhythm. Australia: University of Western Australia Perth; 2000.
34.
Zurück zum Zitat d’Alessandro C, Castellengo M. The pitch of short-duration vibrato tones. The Journal of the Acoustical Society of America. 1994;95(3):1617–30.CrossRef d’Alessandro C, Castellengo M. The pitch of short-duration vibrato tones. The Journal of the Acoustical Society of America. 1994;95(3):1617–30.CrossRef
35.
Zurück zum Zitat Li X, Wang K, Soraghan J, Ren J. Fusion of Hilbert-Huang transform and deep convolutional neural network for predominant musical instruments recognition. In: Proc. Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar); 2020. p. 80–9. Li X, Wang K, Soraghan J, Ren J. Fusion of Hilbert-Huang transform and deep convolutional neural network for predominant musical instruments recognition. In: Proc. Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar); 2020. p. 80–9.
36.
Zurück zum Zitat Kinsler LE, Frey AR, Coppens AB, Sanders JV. Fundamentals of acoustics. John Wiley & Sons; 2000. Kinsler LE, Frey AR, Coppens AB, Sanders JV. Fundamentals of acoustics. John Wiley & Sons; 2000.
38.
Zurück zum Zitat Duan Z, Pardo B, Zhang C. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans Audio Speech Lang Process. 2010;18(8):2121–33.CrossRef Duan Z, Pardo B, Zhang C. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans Audio Speech Lang Process. 2010;18(8):2121–33.CrossRef
39.
Zurück zum Zitat Fritsch J, Plumbley MD. Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis. In: Proc. Acoustics, Speech and Signal Processing (ICASSP); 2013. p. 888–91. Fritsch J, Plumbley MD. Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis. In: Proc. Acoustics, Speech and Signal Processing (ICASSP); 2013. p. 888–91.
40.
Zurück zum Zitat Su L. Between homomorphic signal processing and deep neural networks: constructing deep algorithms for polyphonic music transcription. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC); 2017. p. 884–91. Su L. Between homomorphic signal processing and deep neural networks: constructing deep algorithms for polyphonic music transcription. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC); 2017. p. 884–91.
42.
Zurück zum Zitat Vincent E, Bertin N, Badeau R. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans Audio Speech Lang Process. 2010;18(3):528–37.CrossRef Vincent E, Bertin N, Badeau R. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans Audio Speech Lang Process. 2010;18(3):528–37.CrossRef
43.
Zurück zum Zitat Benetos E, Cherla S, Weyde T. An efficient shift-invariant model for polyphonic music transcription. In: 6th International Workshop on Machine Learning and Music; 2013. Benetos E, Cherla S, Weyde T. An efficient shift-invariant model for polyphonic music transcription. In: 6th International Workshop on Machine Learning and Music; 2013.
44.
Zurück zum Zitat Marolt M. A connectionist approach to automatic transcription of polyphonic piano music. IEEE Trans Multimedia. 2004;6(3):439–49.CrossRef Marolt M. A connectionist approach to automatic transcription of polyphonic piano music. IEEE Trans Multimedia. 2004;6(3):439–49.CrossRef
45.
Zurück zum Zitat Smith JO, Abel JS. Bark and ERB bilinear transforms. IEEE Transactions on speech and Audio Processing. 1999;7(6):697–708.CrossRef Smith JO, Abel JS. Bark and ERB bilinear transforms. IEEE Transactions on speech and Audio Processing. 1999;7(6):697–708.CrossRef
46.
Zurück zum Zitat Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30. Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
47.
Zurück zum Zitat Chen N, Wang S. High-level music descriptor extraction algorithm based on combination of multi-channel CNNs and LSTM. In: ISMIR; 2017. p. 509–14. Chen N, Wang S. High-level music descriptor extraction algorithm based on combination of multi-channel CNNs and LSTM. In: ISMIR; 2017. p. 509–14.
48.
Zurück zum Zitat Parmar N, et al. Image transformer. In: International Conference on Machine Learning. PMLR; 2018. p. 4055–64. Parmar N, et al. Image transformer. In: International Conference on Machine Learning. PMLR; 2018. p. 4055–64.
50.
Zurück zum Zitat Draguns A, Ozoliņš E, Šostaks A, Apinis M, Freivalds K. Residual shuffle-exchange networks for fast processing of long sequences. Proc AAAI Conf Artif Intell. 2021;35(8):7245–53. Draguns A, Ozoliņš E, Šostaks A, Apinis M, Freivalds K. Residual shuffle-exchange networks for fast processing of long sequences. Proc AAAI Conf Artif Intell. 2021;35(8):7245–53.
51.
Zurück zum Zitat Hawthorne C, et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In: International Conference on Learning Representations; 2018. Hawthorne C, et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In: International Conference on Learning Representations; 2018.
Metadaten
Titel
A Music Cognition–Guided Framework for Multi-pitch Estimation
verfasst von
Xiaoquan Li
Yijun Yan
John Soraghan
Zheng Wang
Jinchang Ren
Publikationsdatum
14.06.2022
Verlag
Springer US
Erschienen in
Cognitive Computation / Ausgabe 1/2023
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-022-10031-5