Skip to main content
Top

2019 | OriginalPaper | Chapter

A Practical Singing Voice Detection System Based on GRU-RNN

Authors : Zhigao Chen, Xulong Zhang, Jin Deng, Juanjuan Li, Yiliang Jiang, Wei Li

Published in: Proceedings of the 6th Conference on Sound and Music Technology (CSMT)

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we present a practical three-step approach for singing voice detection based on a gated recurrent unit (GRU) recurrent neural network (RNN) and the proposed method achieves comparable results to state-of-the-art method. We combine four classic features—namely Mel-frequency Cepstral Coefficients (MFCC), Mel-filter Bank, Linear Predictive Cepstral Coefficients (LPCC), and Chroma. Then, the mixed signal is first preprocessed by singing voice separation (SVS) with the Deep U-Net Convolutional Networks. Long short-term memory (LSTM) and GRU are both proposed to solve the gradient vanish problem in RNN. In our experiments, we set the block duration as 120 ms and 720 ms respectively, and we get comparable or better results than results from state-of-the-art methods, while results on Jamendo are not as good as those from RWC-Pop.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125 Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125
2.
go back to reference Kim YE, Whitman B (2002) Singer identification in popular music recordings using voice coding features. In: Proceedings of the 3rd international conference on music information retrieval. Paris, France, pp 13–17 Kim YE, Whitman B (2002) Singer identification in popular music recordings using voice coding features. In: Proceedings of the 3rd international conference on music information retrieval. Paris, France, pp 13–17
3.
go back to reference Salamon J, Gómez E (2012) Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE Trans Audio Speech Lang Process 20(6):1759–1770CrossRef Salamon J, Gómez E (2012) Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE Trans Audio Speech Lang Process 20(6):1759–1770CrossRef
4.
go back to reference Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125 Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125
5.
go back to reference Ono N, Miyamoto K, Le Roux J, Kameoka H, Sagayama S (2008) Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In: Proceeding of 16th European signal processing conference. Lausanne, Switzerland Ono N, Miyamoto K, Le Roux J, Kameoka H, Sagayama S (2008) Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In: Proceeding of 16th European signal processing conference. Lausanne, Switzerland
6.
go back to reference Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T (2017) Singing voice separation with deep U-Net convolutional networks. In: Proceeding of 18th international society for music information retrieval conference. Suzhou, China Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T (2017) Singing voice separation with deep U-Net convolutional networks. In: Proceeding of 18th international society for music information retrieval conference. Suzhou, China
7.
go back to reference Sonnleitner R, Niedermayer B, Widmer G, Schlüter J (2012) A simple and effective spectral feature for speech detection in mixed audio signals. In: Proceedings of the 15th international conference on digital audio effects (DAFx’12). York, UK Sonnleitner R, Niedermayer B, Widmer G, Schlüter J (2012) A simple and effective spectral feature for speech detection in mixed audio signals. In: Proceedings of the 15th international conference on digital audio effects (DAFx’12). York, UK
8.
go back to reference Vembu S, Baumann S (2005) Separation of vocals from polyphonic audio recordings. In: Proceeding of international society for music information retrieval conference, London, UK, pp 337–344 Vembu S, Baumann S (2005) Separation of vocals from polyphonic audio recordings. In: Proceeding of international society for music information retrieval conference, London, UK, pp 337–344
9.
go back to reference Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Las Vegas, USA, pp 1885–1888 Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Las Vegas, USA, pp 1885–1888
10.
go back to reference Lehner B, Sonnleitner R, Widmer G (2013) Towards light-weight, real-time-capable singing voice detection. In: Proceeding of international society for music information retrieval conference. Curitiba, Brazil, pp 53–58 Lehner B, Sonnleitner R, Widmer G (2013) Towards light-weight, real-time-capable singing voice detection. In: Proceeding of international society for music information retrieval conference. Curitiba, Brazil, pp 53–58
11.
go back to reference Lehner B, Widmer G, Sonnleitner, R (2014) On the reduction of false positives in singing voice detection. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Florence, Italy, pp 7480–7484 Lehner B, Widmer G, Sonnleitner, R (2014) On the reduction of false positives in singing voice detection. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Florence, Italy, pp 7480–7484
12.
go back to reference Regnier L, Peeters G (2009) Singing voice detection in music tracks using direct voice vibrato detection. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Taipei, Taiwan, pp 1685–1688 Regnier L, Peeters G (2009) Singing voice detection in music tracks using direct voice vibrato detection. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Taipei, Taiwan, pp 1685–1688
13.
go back to reference Pikrakis A, Kopsinis Y, Kroher N, Díaz-Báñez JM (2016) Unsupervised singing voice detection using dictionary learning. In: Proceeding of 24th European signal processing conference. Budapest, Hungary, pp 1212–1216 Pikrakis A, Kopsinis Y, Kroher N, Díaz-Báñez JM (2016) Unsupervised singing voice detection using dictionary learning. In: Proceeding of 24th European signal processing conference. Budapest, Hungary, pp 1212–1216
14.
go back to reference Lehner B, Widmer G, Bock S (2015) A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In Proceeding of 23rd European signal processing conference. Nice, France, pp 21–25 Lehner B, Widmer G, Bock S (2015) A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In Proceeding of 23rd European signal processing conference. Nice, France, pp 21–25
15.
go back to reference Ellis DPW, Poliner GE (2007) Identifying cover songs’ with chroma features and dynamic programming beat tracking. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Honolulu, USA, pp 1429–1432 Ellis DPW, Poliner GE (2007) Identifying cover songs’ with chroma features and dynamic programming beat tracking. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Honolulu, USA, pp 1429–1432
16.
go back to reference Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, 1412.3555 Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, 1412.3555
17.
go back to reference Rocamora M, Herrera P (2017) Comparing audio descriptors for singing voice detection in music audio files. In: 11th Brazilian symposium on computer music. São Paulo, Brazil, pp 27–36 Rocamora M, Herrera P (2017) Comparing audio descriptors for singing voice detection in music audio files. In: 11th Brazilian symposium on computer music. São Paulo, Brazil, pp 27–36
18.
go back to reference Mauch M, Fujihara H, Yoshii K, Goto M (2011) Timbre and melody features for the recognition of vocal activity and instrumental solos in polyphonic music. In: Proceeding of international society for music information retrieval conference. Miami, Florida, pp 233–238 Mauch M, Fujihara H, Yoshii K, Goto M (2011) Timbre and melody features for the recognition of vocal activity and instrumental solos in polyphonic music. In: Proceeding of international society for music information retrieval conference. Miami, Florida, pp 233–238
19.
go back to reference Eyben F, Weninger F, Squartini S, Schuller B (2013) Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Vancouver, Canada, pp 483–487 Eyben F, Weninger F, Squartini S, Schuller B (2013) Real-life voice activity detection with LSTM recurrent neural networks and an application to hollywood movies. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Vancouver, Canada, pp 483–487
20.
go back to reference Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceeding of international society for music information retrieval conference. Malaga, Spain, pp 121–126 Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceeding of international society for music information retrieval conference. Malaga, Spain, pp 121–126
21.
go back to reference Chan TS, Yeh TC, Fan ZC, Chen HW, Su L, Yang YH, Jang R (2015) Vocal activity informed singing voice separation with the iKala dataset. In: Proceeding of 2015 IEEE international conference on acoustics, speech and signal processing. Brisbane, Australia, pp 718–722 Chan TS, Yeh TC, Fan ZC, Chen HW, Su L, Yang YH, Jang R (2015) Vocal activity informed singing voice separation with the iKala dataset. In: Proceeding of 2015 IEEE international conference on acoustics, speech and signal processing. Brisbane, Australia, pp 718–722
22.
go back to reference Bittner RM, Salamon J, Tierney M, Mauch M, Cannam C, Bello JP (2014) MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: Proceeding of international society for music information retrieval conference, vol 14. Taipei, Taiwan, pp 155–160 Bittner RM, Salamon J, Tierney M, Mauch M, Cannam C, Bello JP (2014) MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: Proceeding of international society for music information retrieval conference, vol 14. Taipei, Taiwan, pp 155–160
23.
go back to reference Gupta H, Gupta D (2016) LPC and LPCC method of feature extraction in speech recognition system. In: Proceeding of 6th international conference cloud system and big data engineering. Noida, India, pp 498–502 Gupta H, Gupta D (2016) LPC and LPCC method of feature extraction in speech recognition system. In: Proceeding of 6th international conference cloud system and big data engineering. Noida, India, pp 498–502
24.
go back to reference Muller M, Ewert S, Kreuzer S (2009) Making chroma features more robust to timbre changes. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Taipei, Taiwan, pp 1877–1880 Muller M, Ewert S, Kreuzer S (2009) Making chroma features more robust to timbre changes. In: Proceeding of IEEE international conference on acoustics, speech and signal processing. Taipei, Taiwan, pp 1877–1880
25.
go back to reference Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125 Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP). Brisbane, Australia, pp 121–125
26.
go back to reference Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manage 45(4):427–437CrossRef Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manage 45(4):427–437CrossRef
Metadata
Title
A Practical Singing Voice Detection System Based on GRU-RNN
Authors
Zhigao Chen
Xulong Zhang
Jin Deng
Juanjuan Li
Yiliang Jiang
Wei Li
Copyright Year
2019
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-8707-4_2