nach oben

Neural Computing and Applications

Erschienen in:

07.02.2019 | Deep learning for music and audio

Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges

verfasst von: Muhammad Huzaifah bin Md Shahrin, Lonce Wyse

Erschienen in: Neural Computing and Applications | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems that arise when adapting the original vision-based framework to handle spectrogram representations. We conclude that CNN architectures with features based on 2D representations and convolutions are better suited for visual images than for time–frequency representations of audio. Despite the awkward fit, experiments show that the Gram matrix determined “style” for audio is more closely aligned with timbral signatures without temporal structure, whereas network layer activity determining audio “content” seems to capture more of the pitch and rhythmic structures. We shed insight on several reasons for the domain differences with illustrative examples. We motivate the use of several types of one-dimensional CNNs that generate results that are better aligned with intuitive notions of audio texture than those based on existing architectures built for images. These ideas also prompt an exploration of audio texture synthesis with architectural variants for extensions to infinite textures, multi-textures, parametric control of receptive fields and the constant-Q transform as an alternative frequency scaling for the spectrogram.

Vorheriger Artikel Singing voice separation using a deep convolutional neural network trained by ideal binary mask and cross entropy

Nächster Artikel One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

See Sect. 5.3.4 frequency scaling.

When referring to layers, the nomenclature used in this paper is as follows: relu1 refers to the ReLu layer within the 1st stack conv3 refers to the convolutional layer in the 3rd stack etc.

https://github.com/muhdhuz/Audio_NeuralStyle.

Athineos M, Ellis D (2003) Sound texture modelling with linear prediction in both time and frequency domains. In: 2003 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 5. IEEE, pp V–648

Beauregard GT, Harish M, Wyse L (2015) Single pass spectrogram inversion. In: 2015 IEEE international conference on digital signal processing (DSP). IEEE, pp 427–431

Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693CrossRef

Choi K, Fazekas G, Sandler M (2016) Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:160702444

Cui B, Qi C, Wang A (2017) Multi-style transfer: generalizing fast style transfer to several genres

Davies ER (2008) Handbook of texture analysis. Imperial College Press, London, UK, chap introduction to texture analysis, pp 1–31

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Computer vision and pattern recognition, 2009. IEEE, pp 248–255

Deng L, Abdel-Hamid O, Yu D (2013) A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: 2013 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6669–6673

Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics. Speech and signal processing (ICASSP). IEEE, pp 6964–6968

10.

Dubnov S, Bar-Joseph Z, El-Yaniv R, Lischinski D, Werman M (2002) Synthesizing sound textures through wavelet tree learning. IEEE Comput Graph Appl 4:38–48CrossRef

11.

Dumoulin V, Shlens J, Kudlur M (2017) A learned representation for artistic style. In: Proceedings of ICLR

12.

Ellis D (2013) Spectrograms: constant-q (log-frequency) and conventional (linear). http://www.ee.columbia.edu/ln/rosa/matlab/sgram/

13.

Gatys LA, Ecker AS, Bethge M (2015a) A neural algorithm of artistic style. arXiv preprint arXiv:150806576

14.

Gatys LA, Ecker AS, Bethge M (2015b) Texture synthesis using convolutional neural networks. In: Advances in neural information processing systems, pp 262–270

15.

Gatys LA, Bethge M, Hertzmann A, Shechtman E (2016) Preserving color in neural artistic style transfer. arXiv preprint arXiv:160605897

16.

Gatys LA, Ecker AS, Bethge M, Hertzmann A, Shechtman E (2017) Controlling perceptual factors in neural style transfer. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)

17.

Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust 32(2):236–243CrossRef

18.

Grinstein E, Duong N, Ozerov A, Perez P (2017) Audio style transfer. arXiv preprint arXiv:171011385

19.

Hoskinson R, Pai D (2001) Manipulation and resynthesis with natural grains. In: Proceedings of the 2001 international computer music conference, ICMC

20.

Huzaifah bin Md Shahrin M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:170607156

21.

Jing Y, Yang Y, Feng Z, Ye J, Song M (2017) Neural style transfer: a review. arXiv preprint arXiv:170504058

22.

Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. Springer, pp 694–711

23.

Julesz B (1962) Visual pattern discrimination. IRE Trans Inf Theory 8(2):84–92CrossRef

24.

Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRef

25.

Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 5188–5196

26.

McDermott JH, Simoncelli EP (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5):926–940CrossRef

27.

Novak R, Nikulin Y (2016) Improving the neural algorithm of artistic style. arXiv preprint arXiv:160504603

28.

Perez A, Proctor C, Jain A (2017) Style transfer for prosodic speech. Tech. rep., Tech. Rep., Stanford University

29.

Piczak KJ (2015) Esc: dataset for environmental sound classification. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1015–1018

30.

Rand TC (1974) Dichotic release from masking for speech. J Acoust Soc Am 55(3):678–680CrossRef

31.

Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283CrossRef

32.

Schwarz D, Schnell N (2010) Descriptor-based sound texture sampling. In: Sound and music computing (SMC), pp 510–515

33.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556

34.

Ulyanov D, Lebedev V (2016) Audio texture synthesis and style transfer. https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/

35.

Ulyanov D, Lebedev V, Vedaldi A, Lempitsky VS (2016a) Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML, pp 1349–1357

36.

Ulyanov D, Vedaldi A, Lempitsky VS (2016b) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:160708022

37.

Ulyanov D, Vedaldi A, Lempitsky VS (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), vol 1, p 3

38.

Ustyuzhaninov I, Brendel W, Gatys LA, Bethge M (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:160600021

39.

Verma P, Smith JO (2018) Neural style transfer for audio spectograms. arXiv preprint arXiv:180101589

40.

Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the first international workshop on deep learning and music joint with IJCNN, pp 37–41

41.

Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833

Titel: Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges
verfasst von: Muhammad Huzaifah bin Md Shahrin
Lonce Wyse
Publikationsdatum: 07.02.2019
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 4/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-019-04053-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 4/2020

From context to concept: exploring semantic relationships in music with word2vec

Deep learning for music generation: challenges and directions

Automatic chord label personalization through deep learning of shared harmonic interval profiles

This time with feeling: learning expressive musical performance

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration