Skip to main content
Top

2020 | OriginalPaper | Chapter

Gen-Res-Net: A Novel Generative Model for Singing Voice Separation

Authors : Congzhou Tian, Hangyu Li, Deshun Yang, Xiaoou Chen

Published in: MultiMedia Modeling

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In most cases, modeling in the time-frequency domain is the most common method to solve the problem of singing voice separation since frequency characteristics differ between different sources. During the past few years, applying recurrent neural network (RNN) to series of split spectrograms has been mostly adopted by researchers to tackle this problem. Recently, however, the U-net’s success has drawn the focus to treating the spectrogram as a 2-dimensional image with an auto-encoder structure, which indicates that some useful methods in image analysis may help solve this problem. Under this scenario, we propose a novel spectrogram-generative model to separate the two sources in the time-frequency domain inspired by Residual blocks, Squeeze and Excitation blocks and WaveNet. We apply none-reduce-sized Residual blocks together with Squeeze and Excitation blocks in the main stream to extract features of the input spectrogram while gathering the output layers in a skip-connection structure used in WaveNet. Experimental results on two datasets (MUSDB18 and CCMixer) have shown that our proposed network performs better than the current state-of-the-art approach working on spectrograms of mixtures – the deep U-net structure.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference van der Merwe, P.: Origins of the Popular Style: The Antecedents of Twentieth-Century Popular Music, p. 320. Clarendon Press, Oxford (1989). ISBN 0-19-316121-4 van der Merwe, P.: Origins of the Popular Style: The Antecedents of Twentieth-Century Popular Music, p. 320. Clarendon Press, Oxford (1989). ISBN 0-19-316121-4
2.
go back to reference Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., Okuno, H.G.: Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In: Proceedings of ISM, pp. 257–264, December 2006 Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., Okuno, H.G.: Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In: Proceedings of ISM, pp. 257–264, December 2006
3.
go back to reference Yang, Y.-H., Chen, H.H.: Machine recognition of music emotion: a review. ACM Trans. Intell. Syst. Technol 40, 1–30 (2012)CrossRef Yang, Y.-H., Chen, H.H.: Machine recognition of music emotion: a review. ACM Trans. Intell. Syst. Technol 40, 1–30 (2012)CrossRef
4.
go back to reference Berenzweig, A., Ellis, D.P.W., Lawrence, S.: Using voice segments to improve artist classification of music. In: AES 22nd International Conference: Virtual, Synthetic, and Entertainment Audio (2002) Berenzweig, A., Ellis, D.P.W., Lawrence, S.: Using voice segments to improve artist classification of music. In: AES 22nd International Conference: Virtual, Synthetic, and Entertainment Audio (2002)
5.
go back to reference Li, Y., Wang, D.: Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 15(4), 1475–1487 (2007)CrossRef Li, Y., Wang, D.: Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 15(4), 1475–1487 (2007)CrossRef
6.
go back to reference Rafii, Z., Pardo, B.: Repeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 73–84 (2013)CrossRef Rafii, Z., Pardo, B.: Repeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 73–84 (2013)CrossRef
8.
go back to reference Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)CrossRef Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)CrossRef
9.
go back to reference Mimilakis, S.I., Drossos, K., Virtanen, T., Schuller, G.: A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, pp. 1–6 (2017) Mimilakis, S.I., Drossos, K., Virtanen, T., Schuller, G.: A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, pp. 1–6 (2017)
10.
go back to reference Luo, Y., Chen, Z., Hershey, J.R., Le Roux, J., Mesgarani, N.: Deep clustering and conventional networks for music separation: stronger together. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 61–65 (2017) Luo, Y., Chen, Z., Hershey, J.R., Le Roux, J., Mesgarani, N.: Deep clustering and conventional networks for music separation: stronger together. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 61–65 (2017)
11.
go back to reference Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., Weyde, T.: Singing voice separation with deep U-net convolutional networks (2017) Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., Weyde, T.: Singing voice separation with deep U-net convolutional networks (2017)
13.
go back to reference Grais, E.M., Ward, D., Plumbley, M.D.: Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1577–1581. IEEE (2018) Grais, E.M., Ward, D., Plumbley, M.D.: Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1577–1581. IEEE (2018)
14.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
15.
go back to reference Hu, J., Shen, L., Sun, G.: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018) Hu, J., Shen, L., Sun, G.: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
17.
go back to reference Rafii, Z., Liutkus, A., Stter, F.-R., Mimilakis, S.I., Bittner, R.: The MUSDB18 corpus for music separation (2017) Rafii, Z., Liutkus, A., Stter, F.-R., Mimilakis, S.I., Bittner, R.: The MUSDB18 corpus for music separation (2017)
18.
go back to reference Liutkus, A., Fitzgerald, D., Rafii, Z.: Scalable audio separation with light kernel additive modelling. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80. IEEE (2015) Liutkus, A., Fitzgerald, D., Rafii, Z.: Scalable audio separation with light kernel additive modelling. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80. IEEE (2015)
19.
go back to reference Stoller, D., Ewert, S., Dixon, S.: Wave-U-Net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185 (2018) Stoller, D., Ewert, S., Dixon, S.: Wave-U-Net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:​1806.​03185 (2018)
20.
go back to reference Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)CrossRef
Metadata
Title
Gen-Res-Net: A Novel Generative Model for Singing Voice Separation
Authors
Congzhou Tian
Hangyu Li
Deshun Yang
Xiaoou Chen
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-37731-1_3