1 Introduction
1.1 Motivation
1.2 Core technical contributions
- We develop multiple-feature-based music class profiling model to characterize different music categories. In terms of functionality, it is a probabilistic classifier to estimate correct label of input music. The scheme can effectively combine multiple features to enhance categorization effectiveness and thus improve the overall retrieval accuracy greatly.
- Distinguished from previous approaches, EMIF’S architecture is designed based on a “Classify-and-Indexing” principle and applies a multiple-layer structure, which consists of two basic components—classification module and indexing module. This innovation enables superior scalability, efficiency and significantly reduces system reconstruction cost, which is a major overhead for existing solutions.
- We develop a novel deep learning-based music signature generation scheme called DMSG to compute compact and comprehensive music descriptor—deep music signature (DMS). The approach can effectively combine various kinds of acoustic features to produce small feature vector to enhance the indexing and retrieval based on the existing access methods.
- We conduct a set of detailed experimental studies and result analysis based on three large test collections. It demonstrates that EMIF enjoys superior scalability, effectiveness and efficiency over the existing approaches.
2 Related work
2.1 Multidimensional indexing structure
2.2 Music signature generation
Notation | Definition |
---|---|
c | Music class c |
f | Feature type f |
L | Loss function of deep learning framework |
C | Number of classes in the database |
B | Number of blocks for music segmentation |
F | Number of acoustic features extracted |
M | Number of training examples for logistic fusion function |
CF | Score combination function |
DMS | Deep music signature |
\(\varTheta _{f}^{s}\) | Parameter set for GMM |
\({\varvec{v}}_{bf}\) | Feature vector extracted from block b for feature type f |
\({\varvec{V}}_{f}\) | Set of feature vectors extracted from different blocks for feature type f |
\(L^{c}\) | Final score generated by logistic combination function for class c |
\(l_{f}^{c}\) | Likelihood value generated by category c’s profile model using feature type f |
\(\mathbf {W^{c}}\) | Fusion weight vector of logistic fusion function for class c |
\(L_{sqe}\) | Squared error loss |
\(Extract_{f}\) | Feature extraction scheme for feature type f |
3 System architecture
3.1 Multifeature-based music category modeling
3.1.1 Feature extraction
- Timbre feature Timbral texture is a global statistical music property used to differentiate a mixture of sounds. It has been widely applied to speech recognition and audio classification. The 33-dimensional feature vector representing timbre feature includes means and variance of spectral centroid, spectral flux, time domain zero crossings and 13 MFCC coefficients (32) plus low energy(1).
- Rhythm feature Rhythmic content indicates reiteration of musical signal over time. It can be represented as beat strength and temporal pattern. The beat histogram (BH) proposed by Tzanetakis et al. [40] is used to describe rhythmic content. The 18-dimensional feature vector is used to represent rhythmic information of music and includes relative amplitude of the first six histogram peaks (divided by the sum of amplitudes), ratio of the amplitude of five histogram peaks (from second to sixth) divided by the amplitude of the first one, period of the first six histogram peaks, and overall sum of the histogram.
- Pitch feature Pitch is an important acoustic feature used to characterize melody and harmony information in music file. It can be extracted via the multi-pitch detection techniques [44]. The 18-dimensional pitch feature vector includes the amplitude and periods of the maximum six peaks in the histogram, pitch interval between the six most prominent peaks and the overall sums of the histograms.
3.1.2 Statistical category profiling with linear discriminative mixture model
3.2 Fusion weight estimation
3.2.1 Logistic regression-based scheme
3.3 Deep music signature generation and music retrieval
- PCA is used to preprocess raw input features from different blocks via linear transformation and speed up learning of SDA.
- SDA is adopted to pretrain neural networks for each block with unlabeled data.
- For each block of input music documents, the parameters of SDA are optimized via stochastic gradient descent [50].
4 Experimental configuration
4.1 Music testbed
4.2 Evaluation metrics and tasks
- Type I: Search music that has similar genre from database constructed using Dataset I.
- Type II: Search music performed by the same artist from database constructed using Dataset II.
- Type III: Search music with the same instrument from database constructed using Dataset III.
4.3 Competitors
- EMIF: In this study, a CBMIR system is built based on EMIF and Hybrid tree is selected as multidimensional indexing structure to speed up music search.
- DWCH + hybrid tree (DWCH+HT): Daubechies wavelet histogram technique (DWCH) is used to extract wavelet-based music signatures to describe music content. Similar to EMIF, Hybrid tree is the indexing structure for speeding up search process.
- MARSYAS + hybrid tree (MARSYAS+HT): MARSYAS framework is used to extract the signatures, which linearly combines three different acoustic features—timbral texture, pitch content and rhythm. Similar to EMIF and DWCH+HT, Hybrid tree is the indexing structure for speeding up search process.
5 An empirical study
5.1 Effectiveness comparison
Query methods | Query accuracy | |
---|---|---|
P@10 | MAP | |
EMIF | 0.617 | 0.511 |
EMIF-W | 0.527 | 0.452 |
DWCH+HT | 0.372 | 0.302 |
MARSYAS+HT | 0.297 | 0.275 |
Query methods | Query accuracy | |
---|---|---|
P@10 | MAP | |
EMIF | 0.603 | 0.505 |
EMIF-W | 0.515 | 0.452 |
DWCH+HT | 0.361 | 0.292 |
MARSYAS+HT | 0.285 | 0.266 |
Query methods | Query accuracy | |
---|---|---|
P@10 | MAP | |
EMIF | 0.725 | 0.617 |
EMIF-W | 0.605 | 0.526 |
DWCH+HT | 0.435 | 0.382 |
MARSYAS+HT | 0.365 | 0.291 |
5.2 Efficiency comparison
Total classes | System reconstruction time (s) | ||
---|---|---|---|
EMIF | DWCH+HT | MARSYAS+HT | |
1 | 407 | 206 | 210 |
2 | 409 | 400 | 408 |
3 | 367 | 720 | 705 |
4 | 390 | 900 | 890 |
5 | 398 | 1200 | 1250 |
6 | 402 | 1805 | 1890 |
7 | 387 | 1951 | 2100 |
8 | 364 | 2345 | 2580 |
9 | 309 | 2876 | 2900 |
10 | 399 | 3320 | 3421 |
Total classes | Query accuracy (P@10) | ||
---|---|---|---|
EMIF | DWCH+HT | MARSYAS+HT | |
1 | 0.661 | 0.543 | 0.537 |
2 | 0.654 | 0.512 | 0.525 |
3 | 0.642 | 0.493 | 0.489 |
4 | 0.637 | 0.472 | 0.476 |
5 | 0.632 | 0.467 | 0.425 |
6 | 0.629 | 0.445 | 0.411 |
7 | 0.625 | 0.431 | 0.386 |
8 | 0.620 | 0.389 | 0.352 |
9 | 0.617 | 0.378 | 0.325 |
10 | 0.617 | 0.372 | 0.297 |
5.3 Scalability comparison
- Case I—static: For the static case, the system is initially trained and tested with 1000 music. Then we increase the dataset to 2000 music, train the system again and evaluate it. This process is repeated until the size of music reaches to 5000 music.
- Case II—incremental: In this setting, the system is trained and evaluated with 1000 music at the first stage. Then, 1000 music is added into the system without rerunning the training process and we carry out the evaluation on the systems again. The process will be repeated until the size of music reaches to 5000 music.
Music size | Query accuracy (P@10) | ||
---|---|---|---|
EMIF | DWCH+HT | MARSYAS+HT | |
1000 | 0.657 | 0.531 | 0.502 |
2000 | 0.645 | 0.506 | 0.461 |
3000 | 0.635 | 0.488 | 0.413 |
4000 | 0.629 | 0.419 | 0.375 |
5000 | 0.617 | 0.372 | 0.297 |
Music size | Query accuracy (P@10) | ||
---|---|---|---|
EMIF | DWCH+HT | MARSYAS+HT | |
1000 | 0.657 | 0.531 | 0.502 |
2000 | 0.625 | 0.506 | 0.461 |
3000 | 0.609 | 0.488 | 0.413 |
4000 | 0.595 | 0.419 | 0.375 |
5000 | 0.590 | 0.372 | 0.297 |