Skip to main content
Erschienen in: Neural Computing and Applications 4/2020

07.02.2019 | Deep learning for music and audio

Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges

verfasst von: Muhammad Huzaifah bin Md Shahrin, Lonce Wyse

Erschienen in: Neural Computing and Applications | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems that arise when adapting the original vision-based framework to handle spectrogram representations. We conclude that CNN architectures with features based on 2D representations and convolutions are better suited for visual images than for time–frequency representations of audio. Despite the awkward fit, experiments show that the Gram matrix determined “style” for audio is more closely aligned with timbral signatures without temporal structure, whereas network layer activity determining audio “content” seems to capture more of the pitch and rhythmic structures. We shed insight on several reasons for the domain differences with illustrative examples. We motivate the use of several types of one-dimensional CNNs that generate results that are better aligned with intuitive notions of audio texture than those based on existing architectures built for images. These ideas also prompt an exploration of audio texture synthesis with architectural variants for extensions to infinite textures, multi-textures, parametric control of receptive fields and the constant-Q transform as an alternative frequency scaling for the spectrogram.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
See Sect. 5.3.4 frequency scaling.
 
2
When referring to layers, the nomenclature used in this paper is as follows: relu1 refers to the ReLu layer within the 1st stack conv3 refers to the convolutional layer in the 3rd stack etc.
 
Literatur
1.
Zurück zum Zitat Athineos M, Ellis D (2003) Sound texture modelling with linear prediction in both time and frequency domains. In: 2003 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 5. IEEE, pp V–648 Athineos M, Ellis D (2003) Sound texture modelling with linear prediction in both time and frequency domains. In: 2003 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 5. IEEE, pp V–648
2.
Zurück zum Zitat Beauregard GT, Harish M, Wyse L (2015) Single pass spectrogram inversion. In: 2015 IEEE international conference on digital signal processing (DSP). IEEE, pp 427–431 Beauregard GT, Harish M, Wyse L (2015) Single pass spectrogram inversion. In: 2015 IEEE international conference on digital signal processing (DSP). IEEE, pp 427–431
3.
Zurück zum Zitat Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693CrossRef Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693CrossRef
4.
Zurück zum Zitat Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:160702444 Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:​160702444
5.
Zurück zum Zitat Cui B, Qi C, Wang A (2017) Multi-style transfer: generalizing fast style transfer to several genres Cui B, Qi C, Wang A (2017) Multi-style transfer: generalizing fast style transfer to several genres
6.
Zurück zum Zitat Davies ER (2008) Handbook of texture analysis. Imperial College Press, London, UK, chap introduction to texture analysis, pp 1–31 Davies ER (2008) Handbook of texture analysis. Imperial College Press, London, UK, chap introduction to texture analysis, pp 1–31
7.
Zurück zum Zitat Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, 2009. IEEE, pp 248–255 Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, 2009. IEEE, pp 248–255
8.
Zurück zum Zitat Deng L, Abdel-Hamid O, Yu D (2013) A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: 2013 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6669–6673 Deng L, Abdel-Hamid O, Yu D (2013) A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: 2013 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6669–6673
9.
Zurück zum Zitat Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6964–6968 Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6964–6968
10.
Zurück zum Zitat Dubnov S, Bar-Joseph Z, El-Yaniv R, Lischinski D, Werman M (2002) Synthesizing sound textures through wavelet tree learning. IEEE Comput Graph Appl 4:38–48CrossRef Dubnov S, Bar-Joseph Z, El-Yaniv R, Lischinski D, Werman M (2002) Synthesizing sound textures through wavelet tree learning. IEEE Comput Graph Appl 4:38–48CrossRef
11.
Zurück zum Zitat Dumoulin V, Shlens J, Kudlur M (2017) A learned representation for artistic style. In: Proceedings of ICLR Dumoulin V, Shlens J, Kudlur M (2017) A learned representation for artistic style. In: Proceedings of ICLR
14.
Zurück zum Zitat Gatys LA, Ecker AS, Bethge M (2015b) Texture synthesis using convolutional neural networks. In: Advances in neural information processing systems, pp 262–270 Gatys LA, Ecker AS, Bethge M (2015b) Texture synthesis using convolutional neural networks. In: Advances in neural information processing systems, pp 262–270
15.
Zurück zum Zitat Gatys LA, Bethge M, Hertzmann A, Shechtman E (2016) Preserving color in neural artistic style transfer. arXiv preprint arXiv:160605897 Gatys LA, Bethge M, Hertzmann A, Shechtman E (2016) Preserving color in neural artistic style transfer. arXiv preprint arXiv:​160605897
16.
Zurück zum Zitat Gatys LA, Ecker AS, Bethge M, Hertzmann A, Shechtman E (2017) Controlling perceptual factors in neural style transfer. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) Gatys LA, Ecker AS, Bethge M, Hertzmann A, Shechtman E (2017) Controlling perceptual factors in neural style transfer. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
17.
Zurück zum Zitat Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust 32(2):236–243CrossRef Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust 32(2):236–243CrossRef
19.
Zurück zum Zitat Hoskinson R, Pai D (2001) Manipulation and resynthesis with natural grains. In: Proceedings of the 2001 international computer music conference, ICMC Hoskinson R, Pai D (2001) Manipulation and resynthesis with natural grains. In: Proceedings of the 2001 international computer music conference, ICMC
20.
Zurück zum Zitat Huzaifah bin Md Shahrin M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:170607156 Huzaifah bin Md Shahrin M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:​170607156
21.
22.
Zurück zum Zitat Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711 Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711
23.
Zurück zum Zitat Julesz B (1962) Visual pattern discrimination. IRE Trans Inf Theory 8(2):84–92CrossRef Julesz B (1962) Visual pattern discrimination. IRE Trans Inf Theory 8(2):84–92CrossRef
24.
Zurück zum Zitat Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRef Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRef
25.
Zurück zum Zitat Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 5188–5196 Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 5188–5196
26.
Zurück zum Zitat McDermott JH, Simoncelli EP (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940CrossRef McDermott JH, Simoncelli EP (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940CrossRef
27.
28.
Zurück zum Zitat Perez A, Proctor C, Jain A (2017) Style transfer for prosodic speech. Tech. rep., Tech. Rep., Stanford University Perez A, Proctor C, Jain A (2017) Style transfer for prosodic speech. Tech. rep., Tech. Rep., Stanford University
29.
Zurück zum Zitat Piczak KJ (2015) Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1015–1018 Piczak KJ (2015) Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1015–1018
30.
Zurück zum Zitat Rand TC (1974) Dichotic release from masking for speech. J Acoust Soc Am 55(3):678–680CrossRef Rand TC (1974) Dichotic release from masking for speech. J Acoust Soc Am 55(3):678–680CrossRef
31.
Zurück zum Zitat Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283CrossRef Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283CrossRef
32.
Zurück zum Zitat Schwarz D, Schnell N (2010) Descriptor-based sound texture sampling. In: Sound and music computing (SMC), pp 510–515 Schwarz D, Schnell N (2010) Descriptor-based sound texture sampling. In: Sound and music computing (SMC), pp 510–515
33.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​14091556
35.
Zurück zum Zitat Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016a) Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp 1349–1357 Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016a) Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp 1349–1357
36.
Zurück zum Zitat Ulyanov D, Vedaldi A, Lempitsky VS (2016b) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:160708022 Ulyanov D, Vedaldi A, Lempitsky VS (2016b) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:​160708022
37.
Zurück zum Zitat Ulyanov D, Vedaldi A, Lempitsky VS (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), vol 1, p 3 Ulyanov D, Vedaldi A, Lempitsky VS (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), vol 1, p 3
38.
Zurück zum Zitat Ustyuzhaninov I, Brendel W, Gatys LA, Bethge M (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:160600021 Ustyuzhaninov I, Brendel W, Gatys LA, Bethge M (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:​160600021
40.
Zurück zum Zitat Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the first international workshop on deep learning and music joint with IJCNN, pp 37–41 Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the first international workshop on deep learning and music joint with IJCNN, pp 37–41
41.
Zurück zum Zitat Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833 Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Metadaten
Titel
Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges
verfasst von
Muhammad Huzaifah bin Md Shahrin
Lonce Wyse
Publikationsdatum
07.02.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 4/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04053-8

Weitere Artikel der Ausgabe 4/2020

Neural Computing and Applications 4/2020 Zur Ausgabe