Skip to main content
Erschienen in: Neural Computing and Applications 4/2020

16.10.2018 | Deep learning for music and audio

Deep learning for music generation: challenges and directions

verfasst von: Jean-Pierre Briot, François Pachet

Erschienen in: Neural Computing and Applications | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In addition to traditional tasks such as prediction, classification and translation, deep learning is receiving growing attention as an approach for music generation, as witnessed by recent research groups such as Magenta at Google and CTRL (Creator Technology Research Lab) at Spotify. The motivation is in using the capacity of deep learning architectures and training techniques to automatically learn musical styles from arbitrary musical corpora and then to generate samples from the estimated distribution. However, a direct application of deep learning to generate content rapidly reaches limits as the generated content tends to mimic the training set without exhibiting true creativity. Moreover, deep learning architectures do not offer direct ways for controlling generation (e.g., imposing some tonality or other arbitrary constraints). Furthermore, deep learning architectures alone are autistic automata which generate music autonomously without human user interaction, far from the objective of interactively assisting musicians to compose and refine music. Issues such as control, structure, creativity and interactivity are the focus of our analysis. In this paper, we select some limitations of a direct application of deep learning to music generation and analyze why the issues are not fulfilled and how to address them by possible approaches. Various examples of recent systems are cited as examples of promising directions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
With many variants such as convolutional networks, recurrent networks, autoencoders, restricted Boltzmann machines [15].
 
2
Music difficult to distinguish from the original corpus.
 
3
Additional challenges are analyzed in [2].
 
4
Two examples are Markov constraints [31] and factor graphs [30].
 
5
The model can be stochastic, such as a restricted Boltzmann machine (RBM) [15], or deterministic, such as a feedforward or a recurrent network. In that latter case, it is common practice to sample from the softmax output in order to introduce variability for the generated content [2].
 
6
Note that this may be a very costly process and moreover with no guarantee to succeed.
 
7
An important specificity of the architecture (not discussed here) is the notion of dilated convolution, where convolution filters are incrementally dilated in order to provide very large receptive fields with just a few layers, while preserving input resolution and computational efficiency [40].
 
8
Both are two-layer LSTMs [22].
 
9
Autoencoders are trained with the same data as input and output and therefore have to discover significative features in order to be able to reconstruct the compressed data.
 
10
Note that this is a simple example of transfer learning [15], with a same domain and a same training, but for a different task.
 
11
A variational autoencoder (VAE) [25] is an autoencoder with the added constraint that the encoded representation (its latent variables) follows some prior probability distribution (usually a Gaussian distribution). Therefore, a variational autoencoder is able to learn a “smooth” latent space mapping to realistic examples.
 
12
Note that one may balance between content and style objectives through some \(\alpha \) and \(\beta \) parameters in the \(\mathcal {L}_{total}\) combined loss function shown at top of Fig. 5.
 
13
In the case of an image, the correlations between visual elements (pixels) are equivalent whatever the direction (horizontal axis, vertical axis, diagonal axis or any arbitrary direction); in other words, correlations are isotropic. In the case of a global representation of musical content (see, e.g., Fig. 12), where the horizontal dimension represents time and the vertical dimension represents the notes, horizontal correlations represent temporal correlations and vertical correlations represent harmonic correlations, which have very different nature.
 
14
Note that this also some kind of style transfer [5], although of a high-level structure and not a low-level timbre as in Sect. 2.4.3.
 
15
They use a deep learning implementation of the Q-learning algorithm. Q Network is trained in parallel to Target Q Network which estimates the value of the gain [41].
 
16
At a more fine-grained level, note-to-note level, than the previous one.
 
17
In order to prioritize the Conductor RNN over the bottom-layer RNN, its initial state is reinitialized with the decoder-generated embedding for each new subsequence.
 
18
On this issue, see a recent paper [6].
 
19
Note that this addresses the issue of avoiding a significant recopy from the training corpus, but it does not prevent reinventing an existing music outside of the training corpus.
 
20
An example of interactive composition environment is FlowComposer [33]. It is based on various techniques such as Markov models, constraint solving and rules.
 
21
The representation shown is of type piano roll with two simultaneous voices (tracks). Parts already processed are in light gray; parts being currently processed have a thick line and are pointed as “current”; notes to be played are in blue.
 
22
J. S. Bach chose various given melodies for a soprano and composed the three additional ones (for alto, tenor and bass) in a counterpoint manner.
 
23
The two bottom lines correspond to metadata (fermata and beat information), not detailed here.
 
24
A more complete survey and analysis is [2].
 
Literatur
1.
Zurück zum Zitat Bretan M, Weinberg G, Heck L (2016) A unit selection methodology for music generation using deep neural networks. arXiv:1612.03789v1 Bretan M, Weinberg G, Heck L (2016) A unit selection methodology for music generation using deep neural networks. arXiv:1612.03789v1
2.
Zurück zum Zitat Briot JP, Hadjeres G, Pachet F (2018) Deep learning techniques for music generation. Computational synthesis and creative systems, Springer, London Briot JP, Hadjeres G, Pachet F (2018) Deep learning techniques for music generation. Computational synthesis and creative systems, Springer, London
3.
Zurück zum Zitat Cope D (2000) The algorithmic composer. A-R Editions Cope D (2000) The algorithmic composer. A-R Editions
4.
Zurück zum Zitat Cun YL, Bengio Y (1998) Convolutional networks for images, speech, and time-series. In: The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 255–258 Cun YL, Bengio Y (1998) Convolutional networks for images, speech, and time-series. In: The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 255–258
5.
Zurück zum Zitat Dai S, Zhang Z, Xia GG (2018) Music style transfer issues: a position paper. arXiv:1803.06841v1 Dai S, Zhang Z, Xia GG (2018) Music style transfer issues: a position paper. arXiv:1803.06841v1
6.
Zurück zum Zitat Deltorn JM (2017) Deep creations: intellectual property and the automata. Front Digit Humanit 4:3CrossRef Deltorn JM (2017) Deep creations: intellectual property and the automata. Front Digit Humanit 4:3CrossRef
7.
Zurück zum Zitat Doya K, Uchibe E (2005) The cyber rodent project: exploration of adaptive mechanisms for self-preservation and self-reproduction. Adapt Behav 13(2):149–160CrossRef Doya K, Uchibe E (2005) The cyber rodent project: exploration of adaptive mechanisms for self-preservation and self-reproduction. Adapt Behav 13(2):149–160CrossRef
8.
Zurück zum Zitat Ebcioğlu K (1988) An expert system for harmonizing four-part chorales. Comput Music J 12(3):43–51CrossRef Ebcioğlu K (1988) An expert system for harmonizing four-part chorales. Comput Music J 12(3):43–51CrossRef
9.
Zurück zum Zitat Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: creative adversarial networks generating “art” by learning about styles and deviating from style norms. arXiv:1706.07068v1 Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: creative adversarial networks generating “art” by learning about styles and deviating from style norms. arXiv:1706.07068v1
10.
Zurück zum Zitat Fabius O, van Amersfoort JR (2015) Variational recurrent auto-encoders. arXiv:1412.6581v6 Fabius O, van Amersfoort JR (2015) Variational recurrent auto-encoders. arXiv:1412.6581v6
11.
Zurück zum Zitat Fernández JD, Vico F (2013) AI methods in algorithmic composition: a comprehensive survey. J Artif Intell Res 48:513–582MathSciNetCrossRef Fernández JD, Vico F (2013) AI methods in algorithmic composition: a comprehensive survey. J Artif Intell Res 48:513–582MathSciNetCrossRef
12.
Zurück zum Zitat Fiebrink R, Caramiaux B (2016) The machine learning algorithm as creative musical tool. arXiv:1611.00379v1 Fiebrink R, Caramiaux B (2016) The machine learning algorithm as creative musical tool. arXiv:1611.00379v1
14.
Zurück zum Zitat Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576v2 Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576v2
15.
Zurück zum Zitat Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgeMATH
16.
Zurück zum Zitat Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozairy S, Courville A, Bengio Y (2014) Generative adversarial nets. arXiv:1406.2661v1 Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozairy S, Courville A, Bengio Y (2014) Generative adversarial nets. arXiv:1406.2661v1
17.
Zurück zum Zitat Graves A (2014) Generating sequences with recurrent neural networks. arXiv:1308.0850v5 Graves A (2014) Generating sequences with recurrent neural networks. arXiv:1308.0850v5
18.
Zurück zum Zitat Hadjeres G, Nielsen F (2017) Interactive music generation with positional constraints using anticipation-RNN. arXiv:1709.06404v1 Hadjeres G, Nielsen F (2017) Interactive music generation with positional constraints using anticipation-RNN. arXiv:1709.06404v1
19.
Zurück zum Zitat Hadjeres G, Pachet F, Nielsen F (2017) DeepBach: a steerable model for bach chorales generation. arXiv:1612.01010v2 Hadjeres G, Pachet F, Nielsen F (2017) DeepBach: a steerable model for bach chorales generation. arXiv:1612.01010v2
20.
Zurück zum Zitat Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv 50(5):69CrossRef Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv 50(5):69CrossRef
21.
Zurück zum Zitat Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRef Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRef
22.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
23.
Zurück zum Zitat Hofstadter D (2001) Staring Emmy straight in the eye—and doing my best not to flinch. In: Cope D (ed) Virtual music—computer synthesis of musical style. MIT Press, Cambridge, pp 33–82 Hofstadter D (2001) Staring Emmy straight in the eye—and doing my best not to flinch. In: Cope D (ed) Virtual music—computer synthesis of musical style. MIT Press, Cambridge, pp 33–82
24.
Zurück zum Zitat Jaques N, Gu S, Turner RE, Eck D (2016) Tuning recurrent neural networks with reinforcement learning. arXiv:1611.02796 Jaques N, Gu S, Turner RE, Eck D (2016) Tuning recurrent neural networks with reinforcement learning. arXiv:1611.02796
25.
Zurück zum Zitat Kingma DP, Welling M (2014) Auto-encoding variational Bayes. arXiv:1312.6114v10 Kingma DP, Welling M (2014) Auto-encoding variational Bayes. arXiv:1312.6114v10
26.
Zurück zum Zitat Lattner S, Grachten M, Widmer G (2016) Imposing higher-level structure in polyphonic music generation using convolutional restricted Boltzmann machines and constraints. arXiv:1612.04742v2 Lattner S, Grachten M, Widmer G (2016) Imposing higher-level structure in polyphonic music generation using convolutional restricted Boltzmann machines and constraints. arXiv:1612.04742v2
27.
Zurück zum Zitat Makris D, Kaliakatsos-Papakostas M, Karydis I, Kermanidis KL (2017) Combining LSTM and feed forward neural networks for conditional rhythm composition. In: Boracchi G, Iliadis L, Jayne C, Likas A (eds) Engineering applications of neural networks: 18th international conference, EANN 2017, Athens, Greece, Aug 25–27, 2017, Proceedings, Springer, London, pp 570–582CrossRef Makris D, Kaliakatsos-Papakostas M, Karydis I, Kermanidis KL (2017) Combining LSTM and feed forward neural networks for conditional rhythm composition. In: Boracchi G, Iliadis L, Jayne C, Likas A (eds) Engineering applications of neural networks: 18th international conference, EANN 2017, Athens, Greece, Aug 25–27, 2017, Proceedings, Springer, London, pp 570–582CrossRef
29.
Zurück zum Zitat Nierhaus G (2009) Algorithmic composition: paradigms of automated music generation. Springer, BerlinCrossRef Nierhaus G (2009) Algorithmic composition: paradigms of automated music generation. Springer, BerlinCrossRef
30.
Zurück zum Zitat Pachet F, Papadopoulos A, Roy P (2017) Sampling variations of sequences for structured music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017), Suzhou, China, Oct 23–27, pp 167–173 Pachet F, Papadopoulos A, Roy P (2017) Sampling variations of sequences for structured music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017), Suzhou, China, Oct 23–27, pp 167–173
31.
Zurück zum Zitat Pachet F, Roy P, Barbieri G (2011) Finite-length markov processes with constraints. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI 2011). Barcelona, Spain, pp 635–642 Pachet F, Roy P, Barbieri G (2011) Finite-length markov processes with constraints. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI 2011). Barcelona, Spain, pp 635–642
32.
Zurück zum Zitat Papadopoulos A, Roy P, Pachet F (2014) Avoiding plagiarism in Markov sequence generation. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI 2014). Québec, PQ, Canada, pp 2731–2737 Papadopoulos A, Roy P, Pachet F (2014) Avoiding plagiarism in Markov sequence generation. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI 2014). Québec, PQ, Canada, pp 2731–2737
33.
Zurück zum Zitat Papadopoulos A, Roy P, Pachet F (2016) Assisted lead sheet composition using FlowComposer. In: Rueher M (ed) Principles and practice of constraint programming: 22nd international conference, CP 2016, Toulouse, France, Sept 5–9, 2016, proceedings. Springer, London, pp 769–785 Papadopoulos A, Roy P, Pachet F (2016) Assisted lead sheet composition using FlowComposer. In: Rueher M (ed) Principles and practice of constraint programming: 22nd international conference, CP 2016, Toulouse, France, Sept 5–9, 2016, proceedings. Springer, London, pp 769–785
34.
Zurück zum Zitat Papadopoulos G, Wiggins G (1999) AI methods for algorithmic composition: a survey, a critical view and future prospects. In: AISB 1999 symposium on musical creativity, pp 110–117 (1999) Papadopoulos G, Wiggins G (1999) AI methods for algorithmic composition: a survey, a critical view and future prospects. In: AISB 1999 symposium on musical creativity, pp 110–117 (1999)
35.
Zurück zum Zitat Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. arXiv:1803.05428v2 Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. arXiv:1803.05428v2
36.
Zurück zum Zitat Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. In: Proceedings of the 35th international conference on machine learning (ICML 2018). ACM, Montréal Roberts A, Engel J, Raffel C, Hawthorne C, Eck D (2018) A hierarchical latent vector model for learning long-term structure in music. In: Proceedings of the 35th international conference on machine learning (ICML 2018). ACM, Montréal
37.
Zurück zum Zitat Steedman M (1984) A generative grammar for Jazz chord sequences. Music Percept 2(1):52–77CrossRef Steedman M (1984) A generative grammar for Jazz chord sequences. Music Percept 2(1):52–77CrossRef
40.
Zurück zum Zitat van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) WaveNet: a generative model for raw audio. arXiv:1609.03499v2 van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) WaveNet: a generative model for raw audio. arXiv:1609.03499v2
41.
Zurück zum Zitat van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning. arXiv:1509.06461v3 van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with double Q-learning. arXiv:1509.06461v3
42.
Zurück zum Zitat Yang LC, Chou SY, Yang YH (2017) MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017). Suzhou, China Yang LC, Chou SY, Yang YH (2017) MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR 2017). Suzhou, China
Metadaten
Titel
Deep learning for music generation: challenges and directions
verfasst von
Jean-Pierre Briot
François Pachet
Publikationsdatum
16.10.2018
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 4/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-018-3813-6

Weitere Artikel der Ausgabe 4/2020

Neural Computing and Applications 4/2020 Zur Ausgabe