Introduction
Estimation and tracking of multiple fundamental frequencies is one of the major tasks in automatic music transcription (AMT) of polyphonic music analysis [
1] and music information retrieval (MIR) [
2], which is referred to as a subtask in the Music Information Retrieval Evaluation eXchange (MIREX).
^{1} Multiple fundamental frequency estimation (MFE), also namely multiple pitch estimation (MPE), is challenging in processing simultaneous notes from multiple instruments in polyphonic music [
3,
4]. There is often a trade-off between the robustness and efficiency of algorithms that focuses more on complexity rather than single-pitch estimation.
According to Benetos et al. [
5], the MPE approaches are categorised into three types, i.e. feature based, spectrogram-factorization based and statistical model–based methods. In feature-based methods, signal processing techniques such as the pitch salience function [
6] and pitch candidate set score function [
7] are used. In spectrogram-factorization methods, both the nonnegative matrix factorisation (NMF) and the probabilistic latent component analysis (PLCA) approaches have received a lot of attention in recent years [
6], and numerous improved versions [
8,
9] based on both methods have been published and are recognised as leading spectrogram factorization-based methods in the MPE domain. The statistical model-based methods employ the maximum a posteriori (MAP) [
3] estimation, maximum likelihood (ML) or Bayesian theory [
10] to detect the fundamental frequencies.. It is worth noting that these three distinct types of MPE approaches can be joined or interacted with [
6] for a variety of applications.
In recent years, many deep learning (DL)–based supervised MPE approaches have also been developed. Cheuk et al. [
11] presented a DL model for AMT by combining the U-Net and bidirectional long short-term memory (BiLSTM) neural network modules. Mukherjee et al. [
12] used statistical characteristics and an extreme learning machine for musical instrument segregation, where LSTM and the recurrent neural network (RNN) [
13] were combined to differentiate the musical chords for AMT. Fan et al. [
14] proposed a deep neural network to extract the singing voice, followed by a dynamic unbroken pitch determination algorithm to track pitches. Sigtia et al. [
15] developed a supervised approach for polyphonic piano music transcription that included a RNN and a probabilistic graphical model. Although DL approaches may provide adequate music transcriptions, they often require high-performance computers and excellent graphic processing units (GPU) to speed-up the lengthy training process [
16]. Furthermore, DL algorithms may suffer from inaccurately labelled data, and the performance may be susceptible to the training samples and the learning procedures used. To this end, in this paper, we focus mainly on cognitive method, where the prior cognitive theories and assumptions from previous studies [
17‐
19] will be used to guide the fundamental pitch detection in polyphonic music pieces.
To distinguish the pitch using harmonic analysis, two types of statistical models are often used. One is the expectation–maximization (EM)-based algorithms [
20], and the other is Bayesian-based algorithms [
21]. For EM-based methods, Emiya et al. [
22] proposed a maximum likelihood–based method for multi-pitch estimation. Duan and Temperley [
23] proposed a three-stage music transcription system and applied maximum likelihood for final note tracking. For Bayesian-based methods, Alvarado Duran [
24] combined Gaussian processes and Bayesian models for multi-pitch estimation. Nishikimi et al. [
25] integrate hidden Markov Model and Bayesian inference together to precisely detect the vocal pitch. Those statistical models can be also considered as shallow learning methods, as data should first be observed to gain some prior knowledge, based on which the experiments should then be conducted. After constant addition of the information of the new samples into prior distribution, the posterior inference can be delivered along with the final results. Although the shallow learning approaches have been widely investigated [
26], they still have much room to improve.
Apart from the aforementioned issues, most MPE methods are designed from the viewpoint of signal processing rather than music cognition, resulting in a lack of sufficient underpinning theory and inefficient modelling. To tackle this issue, we propose a general framework in which music cognitions are used to guide the entire process of MPE. In the pre-processing, inspired by cognitive neuroscience of music [
19], the Constant Q transform (CQT) [
27] is employed to transfer the audio signal to time–frequency spectrogram. The pianoroll transcription is then generated using a conventional matrix factorization approach, shift-invariant probabilistic latent component analysis (SI-PLCA) [
9]. In the harmonic structure detection (HSD) process, the cognitions of harmonic periodicity and instrument timbre [
18] are used to guide the extraction of multiple pitches. The efficacy of the suggested methodologies has been fully validated by experiments on three publicly available datasets.
The major contributions of this paper may be highlighted as follows. First, a new HSD model that incorporates music cognition for multiple fundamental frequency extraction was proposed. Second, we proposed a new note tracking method guided by music connectivity and multi-pitch model. By combining conventional pianoroll transcription approaches and the proposed HSD model, a new music cognition–guided optimization framework is introduced for MPE. Experimental results on three datasets have demonstrated the merits of our approach, when benchmarked with 11 state-of-the-art methods.
The rest of the paper is structured as follows: “
Cognition-guided multiple pitch estimation” describes pre-processing for MPE including time–frequency representation, matrix factorization and the implementation of the proposed harmonic structure detection method. “
Experimental results” presents the experimental results and performance analysis. Finally, a thorough conclusion is drawn in “
Conclusion”.
Conclusion
In this paper, a harmonic analysis method is proposed for the MPE system, inspired by music cognition and perception. CQT and SI-PLCA are employed in the pre-processing stage for pianoroll transcription in mixture music audio signal, from which the proposed HSD is used to extract the multi-pitch pianorolls. The proposed MPE system is not limited by the number of instruments. For multi-instrument cases (i.e. symphony in BACH10 and TRIOS datasets), the mixture characteristics of each instrument can be extracted for adaptive detection of the fundamental frequencies. From the experiment results, the proposed MPE system yields the best performance on the MAPS and TRIOS datasets, and the second-best on the BACH10 dataset. Through investigation of the performance of key components, the HSD provided the greatest contribution to the system, which validates the value of adding an efficient harmonic analysis model for improving significantly the performance of the MPE system. Furthermore, adding note tracking can further improve the efficacy of the MPE system.
However, the proposed MPE system still has much room to improve. First, it is worth mentioning that the expectation maximization (EM) algorithm has some limitations, especially the low convergence speed, sensitive to initial settings and inherent non-convex caused local optimum. As a result, it makes PLCA very time consuming, even unsuitable for processing large datasets. Therefore, how to better select the initial value and speed up the convergence can be a valuable work for future investigation. Second, the assumption of knowing the type of instruments in the music pieces is often unrealistic in real scenarios. Therefore, blind source separation can be integrated in our model to tackle this limitation. Third, analysis of the beat and chord along with integrated deep-learning models such as transformer networks [
46] and long-short term memory [
47] can be considered to further enhance the accuracy of pitch estimation. On the other hand, introducing more music perceptions such as ornaments and rhythm into the model will be helpful for more precise interpreting of the music pieces. Furthermore, an improved note tracking process can be introduced by fusing self-attention [
48] and natural language processing model [
49]. Finally, testing on larger datasets such as MusicNet [
50] and MAESTRO [
51] will be beneficial for more comprehensive modelling and validation.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.