nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

The Deep Learning Revolution in MIR: The Pros and Cons, the Needs and the Challenges

verfasst von : Geoffroy Peeters

Erschienen in: Perception, Representations, Image, Sound, Music

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This paper deals with the deep learning revolution in Music Information Research (MIR), i.e. the switch from knowledge-driven hand-crafted systems to data-driven deep-learning systems. To discuss the pro and cons of this revolution, we first review the basic elements of deep learning and explain how those can be used for audio feature learning or for solving difficult MIR tasks. We then discuss the case of hand-crafted features and demonstrate that, while those where indeed shallow and explainable at the start, they tended to be deep, data-driven and unexplainable over time, already before the reign of deep-learning. The development of these data-driven approaches was allowed by the increasing access to large annotated datasets. We therefore argue that these annotated datasets are today the central and most sustainable element of any MIR research. We propose new ways to obtain those at scale. Finally we highlight a set of challenges to be faced by the deep learning revolution in MIR, especially concerning the consideration of music specificities, the explainability of the models (X-AI) and their environmental cost (Green-AI).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel Methods and Datasets for DJ-Mix Reverse Engineering

http://mires.cc/files/MIRES_Roadmap_ver_1.0.0.pdf.

“More recently, deep learning techniques have been used for automatic feature learning in MIR tasks, where they have been reported to be superior to the use of hand-crafted feature sets for classification tasks, although these results have not yet been replicated in MIREX evaluations. It should be noted however that automatically generated features might not be musically meaningful, which limits their usefulness.”.

such as the adjacent pixels that form a “cat’s ear”.

such as \(\vec {W}_{ij}^{[l]}\) represeting a “cat’s ears” detectors.

1 s of an audio signal with a sampling rate of 100 Hz is a vector of dimension 44 100.

non-synthetic.

or more elaborated versions of it.

or more elaborated algorithms.

Consider the case of “Blurred Lines” by Pharrell Williams and Robin Thicke and “Got to Give It Up” by Marvin Gaye.

https://research.fb.com/publications/a-universal-music-translation-network/.

The “timbre spaces” are the results of a Multi-Dimensional-Scaling (MDS) analysis of similarity/dissimilarity user ratings between pairs of sounds as obtained through perceptual experiments [78].

The idea of using DL for representation learning in audio was initially proposed in the case of speech as described in [52].

using an Expectation-Maximization algorithm.

The citation figures are derived from Google Scholar as of December 15th, 2020.

The first period encloses all the models from the “connectionist speech recognition” approaches [14], “tandem features” [48] “bottleneck features” [45] until the seminal paper of [52] (which defines the new baseline for speech recognition system as the DNN-HMM model).

where a sound x(t) is considered as the results of the convolution of a periodic source signal s(t) with a filter h(t): \(x(t) = (s * e) (t)\).

where a sound with a pitch \(f_0\) is represented in the spectral domain as a set of harmonically related components at frequencies \(h f_0, h \in \mathbb {N}^+\) with amplitudes \(a_h\).

Andén, J., Lostanlen, V., Mallat, S.: Joint time-frequency scattering for audio classification. In: Proceedings of IEEE MLSP (International Workshop on Machine Learning for Signal Processing) (2015)

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of IEEE ICCV (International Conference on Computer Vision) (2017)

Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27CrossRef

Atlas, L., Shamma, S.A.: Joint acoustic and modulation frequency. EURASIP J. Adv. Signal Process. 2003(7), 1–8 (2003)CrossRef

Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Proceedings of NIPS (Conference on Neural Information Processing Systems) (2016)

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (International Conference on Learning Representations) (2015)

Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

Ballet, G., Borghesi, R., Hoffman, P., Lévy, F.: Studio online 3.0: an internet ’killer application’ for remote access to ircam sounds and processing tools. In: Proceeding of JIM (Journées d’Informatique Musicale). Issy-Les-Moulineaux, France (1999)

Basaran, D., Essid, S., Peeters, G.: Main melody extraction with source-filter NMF and C-RNN. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Paris, France, 23–27 September 2018

10.

Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Miami, Florida, USA (2011)

11.

Bittner, R., McFee, B., Salamon, J., Li, P., Bello, J.P.: Deep salience representations for f0 estimation in polyphonic music. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

12.

Bittner, R.M., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.P.: Medleydb: a multitrack dataset for annotation-intensive MIR research. ISMIR 14, 155–160 (2014)

13.

Bogdanov, D., et al.: Essentia: an audio analysis library for music information retrieval. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Curitiba, PR, Brazil (2013)

14.

Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition A Hybrid Approach, vol. 247. Springer, US (1994)CrossRef

15.

Bridle, J.S., Brown, M.D.: An experimental automatic word recognition system. JSRU report 1003(5), 33 (1974)

16.

Brown, G.J., Cooke, M.: Computational auditory scene analysis. Comput. Speech Lang. 8(4), 297–336 (1994)CrossRef

17.

Charbuillet, C., Tardieu, D., Peeters, G.: GMM supervector for content based music similarity. In: Proceeding of DAFx (International Conference on Digital Audio Effects), Paris, France, pp. 425–428, September 2011

18.

Chi, T., Ru, P., Shamma, S.A.: Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118(2), 887–906 (2005)CrossRef

19.

Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP (Conference on Empirical Methods in Natural Language Processing) (2014)

20.

Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks. In: Proceedings of ISMIR (International Society for Music Information Retrieval), New York, USA (2016)

21.

Cohen-Hadria, A., Roebel, A., Peeters, G.: Improving singing voice separation using deep u-net and wave-u-net with data augmentation. In: Proceeding of EUSIPCO (European Signal Processing Conference), Coruña, Spain, 2–6 September 2019

22.

Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: FMA: a dataset for music analysis. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October

23.

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRef

24.

Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv preprint arXiv:2005.00341 (2020)

25.

Dieleman, S.: Recommending music on spotify with deep learning. Technical report (2014). http://benanne.github.io/2014/08/05/spotify-cnns.html

26.

Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

27.

Doras, G., Peeters, G.: Cover detection using dominant melody embeddings. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Delft, The Netherlands, 4–8 November 2019

28.

Doras, G., Peeters, G.: A prototypical triplet loss for cover detection. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

29.

Doras, G., Yesiler, F., Serra, J., Gomez, E., Peeters, G.: Combining musical features for cover detection. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020

30.

Durrieu, J.L., Richard, G., David, B., Févotte, C.: Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Trans. Audio Speech Lang. Process. 18(3), 564–575 (2010)CrossRef

31.

Eghbal-zadeh, H., Lehner, B., Schedl, M., Gerhard, W.: I-vectors for timbre-based music similarity and music artist classification. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Malaga, Spain (2015)

32.

Elizalde, B., Lei, H., Friedland, G.: An i-vector representation of acoustic environments for audio-based video event detection on user generated content. In: 2013 IEEE International Symposium on Multimedia, pp. 114–117. IEEE (2013)

33.

Ellis, D.P.W., Zeng, X., McDermott, J.: Classifying soundtracks with audio texture features. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), pp. 5880–5883. IEEE (2011)

34.

Emiya, V., Badeau, R., David, B.: Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio Speech Lang. Process. 18(6), 1643–1654 (2010)CrossRef

35.

Engel, J., Hantrakul, L., Gu, C., Roberts, A.: DDSP: differentiable digital signal processing. In: Proceeding of ICLR (International Conference on Learning Representations) (2020)

36.

Esling, P., Bazin, T., Bitton, A., Carsault, T., Devis, N.: Ultra-light deep LIR by trimming lottery tickets. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020

37.

Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceeding of ICLR (International Conference on Learning Representations) (2019)

38.

Fujishima, T.: Realtime chord recognition of musical sound: a system using common lisp music. In: Proceedings of ICMC (International Computer Music Conference), pp. 464–467. Beijing, China (1999)

39.

Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Amari, S., Arbib, M.A. (eds.) Competition and Cooperation in Neural Nets, pp. 267–285. Springer, Heidelberg (1982)CrossRef

40.

Gfeller, B., Frank, C., Roblek, D., Sharifi, M., Tagliasacchi, M., Velimirović, M.: Spice: self-supervised pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1118–1128 (2020)CrossRef

41.

Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

42.

Goto, M.: Aist annotation for the RWC music database. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Victoria, BC, Canada, pp. 359–360 (2006)

43.

Goto, M., Hashiguchi, H., Nishimura, T., Oka, R.: RWC music database: popular, classical, and jazz music databases. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Paris, France, pp. 287–288 (2002)

44.

Greenberg, S., Kingsbury, B.E.: The modulation spectrogram: In pursuit of an invariant representation of speech. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 3, pp. 1647–1650. IEEE (1997)

45.

Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for lvcsr of meetings. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 4, pp. IV-757. IEEE (2007)

46.

Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., Pineau, J.: Towards the systematic reporting of the energy and carbon footprints of machine learning. arXiv preprint arXiv:2002.05651 (2020)

47.

Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_1CrossRef

48.

Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional hmm systems. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 3, pp. 1635–1638. IEEE (2000)

49.

Herrera, P., Yeterian, A., Gouyon, F.: Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In: Proceedings of ICMAI (International Conference on Music and Artificial Intelligence), Edinburgh, Scotland (2002)

50.

Herrera, P.: MIRages: an account of music audio extractors, semantic description and context-awareness, in the three ages of MIR. Ph.D. thesis, Music Technology Group (MTG), Universitat Pompeu Fabra, Barcelona (2018)

51.

Hiller Jr, L.A., Isaacson, L.M.: Musical composition with a high speed digital computer. In: Audio Engineering Society Convention 9. Audio Engineering Society (1957)

52.

Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef

53.

Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRef

54.

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef

55.

Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)CrossRef

56.

Humphrey, E.J.: Tutorial: Deep learning in music informatics, demystifying the dark art, Part III - practicum. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Curitiba, PR, Brazil (2013)

57.

Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: deep architectures and automatic feature learning in music informatics. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Porto, Portugal (2012)

58.

Jansson, A.: Musical Source Separation with Deep Learning and Large-Scale Datasets. Ph.D. thesis, City University of Mondon (2020)

59.

Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R., Kumar, A., Weyde, T.: Singing voice separation with deep u-net convolutional networks. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

60.

Jensen, K., Arnspang, K.: Binary decision tree classification of musical sounds. In: Proceedings of ICMC (International Computer Music Conference), Bejing, China (1999)

61.

Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P.: Speaker and session variability in GMM-based speaker verification. IEEE Trans. Audio Speech Lang. Process. 15(4), 1448–1460 (2007)CrossRef

62.

Kereliuk, C., Sturm, B.L., Larsen, J.: Deep learning and music adversaries. IEEE Trans. Multimedia 17(11), 2059–2071 (2015)CrossRef

63.

Kim, T., Lee, J., Nam, J.: Sample-level CNN architectures for music auto-tagging using raw waveforms. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing) (2018)

64.

Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of ICLR (International Conference on Learning Representations) (2013)

65.

Korzeniowski, F., Widmer, G.: Feature learning for chord recognition: The deep chroma extractor. In: Proceedings of ISMIR (International Society for Music Information Retrieval), New York, USA, 7–11 August 2016

66.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 1097–1105 (2012)

67.

Lartillot, O., Toiviainen, P.: A matlab toolbox for musical feature extraction from audio. In: Proceeding of DAFx (International Conference on Digital Audio Effects), pp. 237–244. Bordeaux (2007)

68.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef

69.

Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789 (2017)

70.

Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specific harmonic atoms for mid-level music representation. IEEE Trans. Audio Speech Lang. Process. 16(1), 116–128 (2007)CrossRef

71.

Liu, G., Hansen, J.H.: An investigation into back-end advancements for speaker recognition in multi-session and noisy enrollment scenarios. IEEE Trans. Audio Speech Lang. Process. 22(12), 1978–1992 (2014)CrossRef

72.

Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Plymouth, Massachusetts, USA (2000)

73.

Lundberg, S.M., et al.: Explainable AI for trees: from local explanations to global understanding. arXiv preprint arXiv:1905.04610 (2019)

74.

Mallat, S.: Understanding deep convolutional networks. PhilosophicalTransactions (2016)

75.

Marchetto, E., Peeters, G.: Automatic recognition of sound categories from their vocal imitation using audio primitives automatically found by SI-PLCA and HMM. In: Aramaki, M., Davies, M.E.P., Kronland-Martinet, R., Ystad, S. (eds.) CMMR 2017. LNCS, vol. 11265, pp. 3–22. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01692-0_1CrossRef

76.

Mathieu, B., Essid, S., Fillon, T., Prado, J., Richard, G.: Yaafe, an easy to use and efficient audio feature extraction software. In: Proceedings of ISMIR (International Society for Music Information Retrieval), pp. 441–446. Utrecht, The Netherlands (2010)

77.

Mauch, M., et al.: Omras2 metadata project 2009. In: Late-Breaking/Demo Session of ISMIR (International Society for Music Information Retrieval), Kobe, Japan (2009)

78.

McAdams, S., Windsberg, S., Donnadieu, S., DeSoete, G., Krimphoff, J.: Perceptual scaling of synthesized musical timbres: common dimensions specificities and latent subject classes. Psychol. Res. 58, 177–192 (1995)CrossRef

79.

McDermott, J., Simoncelli, E.: Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5), 926–940 (2011)CrossRef

80.

McFee, B., et al.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25 (2015)

81.

Meseguer Brocal, G., Cohen-Hadria, A., Peeters, G.: Dali: a large dataset of synchronized audio, lyrics and pitch, automatically created using teacher-student. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Paris, France, 23–27 September 2018

82.

Meseguer Brocal, G., Peeters, G.: Creation of a large dataset of synchronised audio, lyrics and notes, automatically created using a teacher-student paradigm. Trans. Int. Soc. Music Inf. Retrieval 3(1), 55–67. https://doiorg/105334/tismir30 2020

83.

Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. In: Proceedings of ICLR (International Conference on Learning Representations) (2019)

84.

Nieto, O., McCallum, M., Davies, M., Robertson, A., Stark, A., Egozy, E.: The harmonix set: Beats, downbeats, and functional segment annotations of western popular music. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Delft, The Netherlands, 4–8 November

85.

Noé, P.G., Parcollet, T., Morchid, M.: Cgcnn: Complex gabor convolutional neural network on raw speech. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

86.

van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

87.

Opolko, F., Wapnick, J.: Mcgill university master samples cd-rom for samplecellvolume 1 (1991)

88.

Pachet, F., Zils, A.: Automatic extraction of music descriptors from acoustic signals. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Barcelona (Spain) (2004)

89.

Parekh, J., Mozharovskyi, P., d’Alche Buc, F.: A framework to learn with interpretation. arXiv preprint arXiv:2010.09345 (2020)

90.

Peeters, G.: A large set of audio features for sound description (similarity and classification) in the cuidado project. Cuidado project report, Ircam (2004)

91.

Peeters, G., Rodet, X.: Hierachical gaussian tree with inertia ratio maximization for the classification of large musical instrument database. In: Proceedingg of DAFx (International Conference on Digital Audio Effects), pp. 318–323. London, UK (2003). peeters03c

92.

Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: Proceedings of IEEE CBMI (International Workshop on Content-Based Multimedia Indexing) (2016)

93.

Pons, J., Serra, X.: Randomly weighted cnns for (music) audio classification. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing) (2019)

94.

Ramona, M., Richard, G., David, B.: Vocal detection in music with support vector machines. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Las Vegas, Nevada, USA, pp. 1885–1888 (2008)

95.

Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT). pp. 1021–1028. IEEE (2018)

96.

Reynolds, D., Quatieri, T., Dunn, R.: Speaker verification using adapted gaussian mixture models. Digit. Signal Proc. 10(1–3), 19–41 (2000)CrossRef

97.

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

98.

Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28CrossRef

99.

Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)CrossRef

100.

Sainath, T.N.: Towards end-to-end speech recognition using deep neural networks. In: Proceedings of ICML (International Conference on Machine Learning) (2015)

101.

Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), pp. 4580–4584. IEEE (2015)

102.

Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

103.

Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and unsupervised feature learning. In: Proceeding of ICML (International Conference on Machine Learning), vol. 2, p. 6 (2011)

104.

Schreiner, C.E., Urbas, J.V.: Representation of amplitude modulation in the auditory cortex of the cat. i. the anterior auditory field (aaf). Hearing Res. 21(3), 227–241 (1986)

105.

Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), pp. 815–823 (2015)

106.

Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. CACM, Assoc. Comput. Mach. 63, 54–63 (2020)

107.

Serrà, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local alignment applied to cover song identification. IEEE Trans. Audio Speech Lang. Process. (2008)

108.

Serra, X., et al.: Roadmap for Music Information Research. Creative Commons BY-NC-ND 3.0 license (2013). ISBN: 978-2-9540351-1-6

109.

Serra, X., Smith, J.: Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Comput. Music J. 14(4), 12–24 (1990)CrossRef

110.

Seyerlehner, K.: Content-based music recommender systems: beyond simple frame-level audio similarity. Ph.D. thesis, Johannes Kepler Universität, Linz, Austria, December 2010

111.

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: Proceedings of ICLR (International Conference on Learning Representations) (2014)

112.

Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music transcription. In: Proceedings of IEEE WASPAA (Workshop on Applications of Signal Processing to Audio and Acoustics), New Paltz, NY, USA, pp. 177–180. IEEE (2003)

113.

Smaragdis, P., Venkataramani, S.: A neural network alternative to non-negative audio models. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing). pp. 86–90. IEEE (2017)

114.

Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. Proceedings of ACL (Conference of the Association for Computational Linguistics) (2019)

115.

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 3104–3112 (2014)

116.

Szegedy, C., et al.: Intriguing properties of neural networks. In: Proceedings of ICLR (International Conference on Learning Representations) (2013)

117.

Tzanetakis, G., Cook, P.: Marsyas: a framework for audio analysis. OrganisedSound 4(3) (1999)

118.

Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)CrossRef

119.

Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 5998–6008 (2017)

120.

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), pp. 3156–3164 (2015)

121.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. In: Readings in Speech Recognition, pp. 393–404. Elsevier (1990)

122.

Wakefield, G.H.: Mathematical representation of joint time-chroma distributions. In: Proceedings of SPIE conference on Advanced Signal Processing Algorithms, Architecture and Implementations, Denver, Colorado, USA, pp. 637–645 (1999)

123.

Won, M., Chun, S., Nieto, O., Serra, X.: Data-driven harmonic filters for audio representation learning. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

124.

Wu, C.W., Lerch, A.: Automatic drum transcription using the student-teacher learning paradigm with unlabeled music data. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

125.

Zalkow, F., Müller, M.: Using weakly aligned score-audio pairs to train deep chroma models for cross-modal music retrieval. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020, pp. 184–191

126.

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The Sound of Pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35CrossRef

Titel: The Deep Learning Revolution in MIR: The Pros and Cons, the Needs and the Challenges
verfasst von: Geoffroy Peeters
Verlag: Springer International Publishing
Buch: Perception, Representations, Image, Sound, Music
Print ISBN: 978-3-030-70209-0

Electronic ISBN: 978-3-030-70210-6

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-70210-6_1

Neuer Inhalt

Bildnachweise

Smart-Manufacturing Dashboard Banner/© AdobeStock_583269095, VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Sustainability in Finance_2024/© Robert Kneschke / stock.adobe.com, Search Icon, Banner Hanser, Dirk Wolters/© Netec GmbH, NPL Kreditzweitmarktgesetz/© Good_Stock / Getty Images / iStock, Verbrennungsmotor/© OkFoto.it / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.