nach oben

Neural Computing and Applications

Erschienen in:

16.10.2018 | Deep learning for music and audio

Deep learning for music generation: challenges and directions

verfasst von: Jean-Pierre Briot, François Pachet

Erschienen in: Neural Computing and Applications | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In addition to traditional tasks such as prediction, classification and translation, deep learning is receiving growing attention as an approach for music generation, as witnessed by recent research groups such as Magenta at Google and CTRL (Creator Technology Research Lab) at Spotify. The motivation is in using the capacity of deep learning architectures and training techniques to automatically learn musical styles from arbitrary musical corpora and then to generate samples from the estimated distribution. However, a direct application of deep learning to generate content rapidly reaches limits as the generated content tends to mimic the training set without exhibiting true creativity. Moreover, deep learning architectures do not offer direct ways for controlling generation (e.g., imposing some tonality or other arbitrary constraints). Furthermore, deep learning architectures alone are autistic automata which generate music autonomously without human user interaction, far from the objective of interactively assisting musicians to compose and refine music. Issues such as control, structure, creativity and interactivity are the focus of our analysis. In this paper, we select some limitations of a direct application of deep learning to music generation and analyze why the issues are not fulfilled and how to address them by possible approaches. Various examples of recent systems are cited as examples of promising directions.

Vorheriger Artikel Towards a Deep Improviser: a prototype deep learning post-tonal free music generator

Nächster Artikel Anticipation-RNN: enforcing unary constraints in sequence generation, with application to interactive music generation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

With many variants such as convolutional networks, recurrent networks, autoencoders, restricted Boltzmann machines [15].

Music difficult to distinguish from the original corpus.

Additional challenges are analyzed in [2].

Two examples are Markov constraints [31] and factor graphs [30].

The model can be stochastic, such as a restricted Boltzmann machine (RBM) [15], or deterministic, such as a feedforward or a recurrent network. In that latter case, it is common practice to sample from the softmax output in order to introduce variability for the generated content [2].

Note that this may be a very costly process and moreover with no guarantee to succeed.

An important specificity of the architecture (not discussed here) is the notion of dilated convolution, where convolution filters are incrementally dilated in order to provide very large receptive fields with just a few layers, while preserving input resolution and computational efficiency [40].

Both are two-layer LSTMs [22].

Autoencoders are trained with the same data as input and output and therefore have to discover significative features in order to be able to reconstruct the compressed data.

Note that this is a simple example of transfer learning [15], with a same domain and a same training, but for a different task.

A variational autoencoder (VAE) [25] is an autoencoder with the added constraint that the encoded representation (its latent variables) follows some prior probability distribution (usually a Gaussian distribution). Therefore, a variational autoencoder is able to learn a “smooth” latent space mapping to realistic examples.

Note that one may balance between content and style objectives through some \(\alpha \) and \(\beta \) parameters in the \(\mathcal {L}_{total}\) combined loss function shown at top of Fig. 5.

In the case of an image, the correlations between visual elements (pixels) are equivalent whatever the direction (horizontal axis, vertical axis, diagonal axis or any arbitrary direction); in other words, correlations are isotropic. In the case of a global representation of musical content (see, e.g., Fig. 12), where the horizontal dimension represents time and the vertical dimension represents the notes, horizontal correlations represent temporal correlations and vertical correlations represent harmonic correlations, which have very different nature.

Note that this also some kind of style transfer [5], although of a high-level structure and not a low-level timbre as in Sect. 2.4.3.

They use a deep learning implementation of the Q-learning algorithm. Q Network is trained in parallel to Target Q Network which estimates the value of the gain [41].

At a more fine-grained level, note-to-note level, than the previous one.

In order to prioritize the Conductor RNN over the bottom-layer RNN, its initial state is reinitialized with the decoder-generated embedding for each new subsequence.

On this issue, see a recent paper [6].

Note that this addresses the issue of avoiding a significant recopy from the training corpus, but it does not prevent reinventing an existing music outside of the training corpus.

An example of interactive composition environment is FlowComposer [33]. It is based on various techniques such as Markov models, constraint solving and rules.

The representation shown is of type piano roll with two simultaneous voices (tracks). Parts already processed are in light gray; parts being currently processed have a thick line and are pointed as “current”; notes to be played are in blue.

J. S. Bach chose various given melodies for a soprano and composed the three additional ones (for alto, tenor and bass) in a counterpoint manner.

The two bottom lines correspond to metadata (fermata and beat information), not detailed here.

A more complete survey and analysis is [2].

Bretan M, Weinberg G, Heck L (2016) A unit selection methodology for music generation using deep neural networks. arXiv:1612.03789v1

Briot JP, Hadjeres G, Pachet F (2018) Deep learning techniques for music generation. Computational synthesis and creative systems, Springer, London

Cope D (2000) The algorithmic composer. A-R Editions

Cun YL, Bengio Y (1998) Convolutional networks for images, speech, and time-series. In: The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 255–258

Dai S, Zhang Z, Xia GG (2018) Music style transfer issues: a position paper. arXiv:1803.06841v1

Deltorn JM (2017) Deep creations: intellectual property and the automata. Front Digit Humanit 4:3CrossRef

Doya K, Uchibe E (2005) The cyber rodent project: exploration of adaptive mechanisms for self-preservation and self-reproduction. Adapt Behav 13(2):149–160CrossRef

Ebcioğlu K (1988) An expert system for harmonizing four-part chorales. Comput Music J 12(3):43–51CrossRef

Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: creative adversarial networks generating “art” by learning about styles and deviating from style norms. arXiv:1706.07068v1

10.

Fabius O, van Amersfoort JR (2015) Variational recurrent auto-encoders. arXiv:1412.6581v6

11.

Fernández JD, Vico F (2013) AI methods in algorithmic composition: a comprehensive survey. J Artif Intell Res 48:513–582MathSciNetCrossRef

12.

Fiebrink R, Caramiaux B (2016) The machine learning algorithm as creative musical tool. arXiv:1611.00379v1

13.

Foote D, Yang D, Rohaninejad M (2016) Audio style transfer: do androids dream of electric beats? https://audiostyletransfer.wordpress.com

14.

Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576v2

15.

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH

16.

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozairy S, Courville A, Bengio Y (2014) Generative adversarial nets. arXiv:1406.2661v1

17.

Graves A (2014) Generating sequences with recurrent neural networks. arXiv:1308.0850v5

18.

Hadjeres G, Nielsen F (2017) Interactive music generation with positional constraints using anticipation-RNN. arXiv:1709.06404v1

19.

Hadjeres G, Pachet F, Nielsen F (2017) DeepBach: a steerable model for bach chorales generation. arXiv:1612.01010v2

20.

Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv 50(5):69CrossRef

21.

Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRef

22.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

23.

Hofstadter D (2001) Staring Emmy straight in the eye—and doing my best not to flinch. In: Cope D (ed) Virtual music—computer synthesis of musical style. MIT Press, Cambridge, pp 33–82

24.

Jaques N, Gu S, Turner RE, Eck D (2016) Tuning recurrent neural networks with reinforcement learning. arXiv:1611.02796

25.

Kingma DP, Welling M (2014) Auto-encoding variational Bayes. arXiv:1312.6114v10

26.

Lattner S, Grachten M, Widmer G (2016) Imposing higher-level structure in polyphonic music generation using convolutional restricted Boltzmann machines and constraints. arXiv:1612.04742v2

27.

Makris D, Kaliakatsos-Papakostas M, Karydis I, Kermanidis KL (2017) Combining LSTM and feed forward neural networks for conditional rhythm composition. In: Boracchi G, Iliadis L, Jayne C, Likas A (eds) Engineering applications of neural networks: 18th international conference, EANN 2017, Athens, Greece, Aug 25–27, 2017, Proceedings, Springer, London, pp 570–582CrossRef

28.

Mordvintsev A, Olah C, Tyka M (2015) Deep dream. https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

29.

Nierhaus G (2009) Algorithmic composition: paradigms of automated music generation. Springer, BerlinCrossRef

30.

Pachet F, Papadopoulos A, Roy P (2017) Sampling variations of sequences for structured music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017), Suzhou, China, Oct 23–27, pp 167–173

31.

Pachet F, Roy P, Barbieri G (2011) Finite-length markov processes with constraints. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI 2011). Barcelona, Spain, pp 635–642

32.

Papadopoulos A, Roy P, Pachet F (2014) Avoiding plagiarism in Markov sequence generation. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI 2014). Québec, PQ, Canada, pp 2731–2737

33.

Papadopoulos A, Roy P, Pachet F (2016) Assisted lead sheet composition using FlowComposer. In: Rueher M (ed) Principles and practice of constraint programming: 22nd international conference, CP 2016, Toulouse, France, Sept 5–9, 2016, proceedings. Springer, London, pp 769–785

34.

Papadopoulos G, Wiggins G (1999) AI methods for algorithmic composition: a survey, a critical view and future prospects. In: AISB 1999 symposium on musical creativity, pp 110–117 (1999)

35.

Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. arXiv:1803.05428v2

36.

Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. In: Proceedings of the 35th international conference on machine learning (ICML 2018). ACM, Montréal

37.

Steedman M (1984) A generative grammar for Jazz chord sequences. Music Percept 2(1):52–77CrossRef

38.

Sun F (2017) DeepHear—composing and harmonizing music with neural networks. https://fephsun.github.io/2015/09/01/neural-music.html. Accessed 21 Dec 2017

39.

Ulyanov D, Lebedev V (2016) Audio texture synthesis and style transfer. https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/

40.

van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) WaveNet: a generative model for raw audio. arXiv:1609.03499v2

41.

van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning. arXiv:1509.06461v3

42.

Yang LC, Chou SY, Yang YH (2017) MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017). Suzhou, China

Titel: Deep learning for music generation: challenges and directions
verfasst von: Jean-Pierre Briot
François Pachet
Publikationsdatum: 16.10.2018
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 4/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-018-3813-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 4/2020

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

This time with feeling: learning expressive musical performance

A possibilistic closed-loop supply chain: pricing, advertising and remanufacturing optimization

From context to concept: exploring semantic relationships in music with word2vec

Automatic chord label personalization through deep learning of shared harmonic interval profiles

Singing voice separation using a deep convolutional neural network trained by ideal binary mask and cross entropy