skip to main content
research-article

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Authors Info & Claims
Published:13 February 2019Publication History
Skip Abstract Section

Abstract

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

References

  1. Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8325 (MMM’14), Dublin, Ireland. 303--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13), Atlanta, GA. III--1247--III--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Brochu, Nando de Freitas, and Kejie Bao. 2003. The sound of an album cover: Probabilistic multimedia and information retrieval. In Proceedings of Workshop on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  4. Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. 2017. Collective deep quantization for efficient cross-modal retrieval. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 3974--3980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Keunwoo Choi, György Fazekas, and Mark B. Sandler. 2016. Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016. 805--811.Google ScholarGoogle Scholar
  6. Yandre M. G. Costa, Luiz S. Oliveira, and Carlos N. Silla. 2017. An evaluation of convolutional neural networks for music classification using spectrograms. Applied Soft Computing 52, C (2017), 28--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Olivier Gillet, Slim Essid, and Gal Richard. 2007. On the correlation of automatic audio and visual segmentations of music videos. IEEE Transactions on Circuits and Systems for Video Technology 17, 3 (2007), 347--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Philippe Hamel, Matthew E. P. Davies, Kazuyoshi Yoshii, and Masataka Goto. 2013. Transfer learning in MIR: Sharing learned latent representations for music audio classification and similarity. In Proceedings of 14th International Conference on Music Information Retrieval (ISMIR'13), Curitiba, PR, Brazil.Google ScholarGoogle Scholar
  9. Wanjia He, Weiran Wang, and Karen Livescu. 2016. Multi-view recurrent neural acoustic word embeddings. CoRR abs/1611.04496 (2016). http://arxiv.org/abs/1611.04496Google ScholarGoogle Scholar
  10. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA. 131--135.Google ScholarGoogle ScholarCross RefCross Ref
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computing 9, 8 (Nov. 1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sungeun Hong, Woobin Im, and Hyun Seung Yang. 2017. Deep learning for content-based, cross-modal retrieval of videos and music. CoRR abs/1704.06761 (2017). arxiv:1704.06761 http://arxiv.org/abs/1704.06761Google ScholarGoogle Scholar
  13. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI. 7254--7262.Google ScholarGoogle ScholarCross RefCross Ref
  15. Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), Honolulu, HI. 3270--3278.Google ScholarGoogle ScholarCross RefCross Ref
  16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR abs/1607.05368 (2016).Google ScholarGoogle Scholar
  18. Honglak Lee, Yan Largman, Peter Pham, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 1096--1104. http://dl.acm.org/citation.cfm?id=2984093.2984217 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. 2017. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. CoRR abs/1703.01789 (2017). arxiv:1703.01789 http://arxiv.org/abs/1703.01789Google ScholarGoogle Scholar
  20. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich. 740--755.Google ScholarGoogle Scholar
  21. Ricardo Malheiro, Renato Panda, Paulo Gomes, and Rui Pedro Paiva. 2018. Emotionally-relevant features for classification and regression of music lyrics. IEEE Transactions on Affective Computing 9, 2 (2018), 240--254.Google ScholarGoogle ScholarCross RefCross Ref
  22. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, Maryland. 55--60.Google ScholarGoogle ScholarCross RefCross Ref
  23. Rudolf Mayer. 2011. Analysing the similarity of album art with self-organising maps. In Proceedings of the 8th International Workshop on Advances in Self-Organizing Maps (WSOM’11), Espoo, Finland, June 13--15, 2011. 357--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Matt McVicar, Tim Freeman, and Tijl De Bie. 2011. Mining the correlation between lyrical and audio features and the emergence of mood. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR’11), Miami, Florida. 783--788.Google ScholarGoogle Scholar
  25. Rada Mihalcea and Carlo Strapparava. 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12), Jeju Island, Korea. 590--599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Loris Nanni, Yandre M. G. Costa, Alessandra Lumini, Moo Young Kim, and Seung Ryul Baek. 2016. Combining visual and acoustic features for music genre classification. Expert Systems Applications 45, C (2016), 108--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Bruno Nettl. 2000. An ethnomusicologist contemplates universals in musical sound and musical culture. In N.Wallin, B. Merker, and S. Brown, editors. The origins of music. MIT Press, 463--472.Google ScholarGoogle Scholar
  28. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia (MM'10), Firenze, Italy. 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14), Orlando, FL. 607--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Siddharth Sigtia and Simon Dixon. 2014. Improved music feature learning with deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy. 6959--6963.Google ScholarGoogle ScholarCross RefCross Ref
  31. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).Google ScholarGoogle Scholar
  32. Alexandros Tsaptsinos. 2017. Lyrics-based music genre classification using a hierarchical attention network. CoRR abs/1707.04678 (2017).Google ScholarGoogle Scholar
  33. Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia 18, 7 (2016), 1305--1318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, Massachusetts. 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  35. Yi Yu, Michel Crucianu, Vincent Oria, and Lei Chen. 2009. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. In Proceedings of the 17th ACM International Conference on Multimedia (MM’09), Beijing, China. 341--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yi Yu, Michel Crucianu, Vincent Oria, and Ernesto Damiani. 2010. Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (MM’10), Firenze, Italy. 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video captioning and retrieval models with semantic attention. CoRR abs/1610.02947 (2016).Google ScholarGoogle Scholar
  38. Yi Yu, Zhijie Shen, and Roger Zimmermann. 2012. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM International Conference on Multimedia (MM’12), Nara, Japan. 1377--1378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2017. VenueNet: Fine-grained venue discovery by deep correlation learning. In Proceedings of the 19th IEEE International Symposium on Multimedia (ISM’17), Taichung, Taiwan. 3974--3980.Google ScholarGoogle ScholarCross RefCross Ref
  40. Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems (2018).Google ScholarGoogle Scholar
  41. Yi Yu, Roger Zimmermann, Ye Wang, and Vincent Oria. 2013. Scalable content-based music retrieval using chord progression histogram and tree-structure LSH. IEEE Transactions on Multimedia 15, 8 (2013), 1969--1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing. 2017. Deep Multi-label Hashing for Large-Scale Visual Search Based on Semantic Graph. In Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (APWeb-WAIM'17), Beijing, China. Springer International Publishing, 169--184.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1
        February 2019
        265 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3309717
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 February 2019
        • Revised: 1 September 2018
        • Accepted: 1 September 2018
        • Received: 1 March 2018
        Published in tomm Volume 15, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format