research-article

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Authors:
Yi Yu

National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

0000-0002-0294-6620
View Profile

,
Suhua Tang

The University of Electro-Communications, Chofu, Tokyo, Japan

The University of Electro-Communications, Chofu, Tokyo, Japan
View Profile

,
Francisco Raposo

Universidade de Lisboa, Lisboa, Portugal

Universidade de Lisboa, Lisboa, Portugal
View Profile

,
Lei Chen

Hong Kong University of Science and Technology, Kowloon, Hong Kong

Hong Kong University of Science and Technology, Kowloon, Hong Kong
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15 Issue 1Article No.: 20pp 1–16https://doi.org/10.1145/3281746

Published:13 February 2019Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

References

Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8325 (MMM’14), Dublin, Ireland. 303--314. Google ScholarDigital Library
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13), Atlanta, GA. III--1247--III--1255. Google ScholarDigital Library
Eric Brochu, Nando de Freitas, and Kejie Bao. 2003. The sound of an album cover: Probabilistic multimedia and information retrieval. In Proceedings of Workshop on Artificial Intelligence and Statistics.Google Scholar
Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. 2017. Collective deep quantization for efficient cross-modal retrieval. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 3974--3980. Google ScholarDigital Library
Keunwoo Choi, György Fazekas, and Mark B. Sandler. 2016. Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016. 805--811.Google Scholar
Yandre M. G. Costa, Luiz S. Oliveira, and Carlos N. Silla. 2017. An evaluation of convolutional neural networks for music classification using spectrograms. Applied Soft Computing 52, C (2017), 28--38. Google ScholarDigital Library
Olivier Gillet, Slim Essid, and Gal Richard. 2007. On the correlation of automatic audio and visual segmentations of music videos. IEEE Transactions on Circuits and Systems for Video Technology 17, 3 (2007), 347--355. Google ScholarDigital Library
Philippe Hamel, Matthew E. P. Davies, Kazuyoshi Yoshii, and Masataka Goto. 2013. Transfer learning in MIR: Sharing learned latent representations for music audio classification and similarity. In Proceedings of 14th International Conference on Music Information Retrieval (ISMIR'13), Curitiba, PR, Brazil.Google Scholar
Wanjia He, Weiran Wang, and Karen Livescu. 2016. Multi-view recurrent neural acoustic word embeddings. CoRR abs/1611.04496 (2016). http://arxiv.org/abs/1611.04496Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA. 131--135.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computing 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
Sungeun Hong, Woobin Im, and Hyun Seung Yang. 2017. Deep learning for content-based, cross-modal retrieval of videos and music. CoRR abs/1704.06761 (2017). arxiv:1704.06761 http://arxiv.org/abs/1704.06761Google Scholar
Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarCross Ref
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI. 7254--7262.Google ScholarCross Ref
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), Honolulu, HI. 3270--3278.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84--90. Google ScholarDigital Library
Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR abs/1607.05368 (2016).Google Scholar
Honglak Lee, Yan Largman, Peter Pham, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 1096--1104. http://dl.acm.org/citation.cfm?id=2984093.2984217 Google ScholarDigital Library
Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. 2017. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. CoRR abs/1703.01789 (2017). arxiv:1703.01789 http://arxiv.org/abs/1703.01789Google Scholar
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich. 740--755.Google Scholar
Ricardo Malheiro, Renato Panda, Paulo Gomes, and Rui Pedro Paiva. 2018. Emotionally-relevant features for classification and regression of music lyrics. IEEE Transactions on Affective Computing 9, 2 (2018), 240--254.Google ScholarCross Ref
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, Maryland. 55--60.Google ScholarCross Ref
Rudolf Mayer. 2011. Analysing the similarity of album art with self-organising maps. In Proceedings of the 8th International Workshop on Advances in Self-Organizing Maps (WSOM’11), Espoo, Finland, June 13--15, 2011. 357--366. Google ScholarDigital Library
Matt McVicar, Tim Freeman, and Tijl De Bie. 2011. Mining the correlation between lyrical and audio features and the emergence of mood. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR’11), Miami, Florida. 783--788.Google Scholar
Rada Mihalcea and Carlo Strapparava. 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12), Jeju Island, Korea. 590--599. Google ScholarDigital Library
Loris Nanni, Yandre M. G. Costa, Alessandra Lumini, Moo Young Kim, and Seung Ryul Baek. 2016. Combining visual and acoustic features for music genre classification. Expert Systems Applications 45, C (2016), 108--117. Google ScholarDigital Library
Bruno Nettl. 2000. An ethnomusicologist contemplates universals in musical sound and musical culture. In N.Wallin, B. Merker, and S. Brown, editors. The origins of music. MIT Press, 463--472.Google Scholar
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia (MM'10), Firenze, Italy. 251--260. Google ScholarDigital Library
Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14), Orlando, FL. 607--616. Google ScholarDigital Library
Siddharth Sigtia and Simon Dixon. 2014. Improved music feature learning with deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy. 6959--6963.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).Google Scholar
Alexandros Tsaptsinos. 2017. Lyrics-based music genre classification using a hierarchical attention network. CoRR abs/1707.04678 (2017).Google Scholar
Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia 18, 7 (2016), 1305--1318. Google ScholarDigital Library
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, Massachusetts. 3441--3450.Google ScholarCross Ref
Yi Yu, Michel Crucianu, Vincent Oria, and Lei Chen. 2009. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. In Proceedings of the 17th ACM International Conference on Multimedia (MM’09), Beijing, China. 341--350. Google ScholarDigital Library
Yi Yu, Michel Crucianu, Vincent Oria, and Ernesto Damiani. 2010. Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (MM’10), Firenze, Italy. 381--390. Google ScholarDigital Library
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video captioning and retrieval models with semantic attention. CoRR abs/1610.02947 (2016).Google Scholar
Yi Yu, Zhijie Shen, and Roger Zimmermann. 2012. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM International Conference on Multimedia (MM’12), Nara, Japan. 1377--1378. Google ScholarDigital Library
Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2017. VenueNet: Fine-grained venue discovery by deep correlation learning. In Proceedings of the 19th IEEE International Symposium on Multimedia (ISM’17), Taichung, Taiwan. 3974--3980.Google ScholarCross Ref
Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems (2018).Google Scholar
Yi Yu, Roger Zimmermann, Ye Wang, and Vincent Oria. 2013. Scalable content-based music retrieval using chord progression histogram and tree-structure LSH. IEEE Transactions on Multimedia 15, 8 (2013), 1969--1981. Google ScholarDigital Library
Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing. 2017. Deep Multi-label Hashing for Large-Scale Visual Search Based on Semantic Graph. In Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (APWeb-WAIM'17), Beijing, China. Springer International Publishing, 169--184.Google ScholarCross Ref

Index Terms

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Music retrieval

Recommendations

Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning
Lyrics are the words that make up a song, while chords are harmonic sets of multiple notes in music. Lyrics and chords are generally essential information in music, i.e. unaccompanied singing vocals mixed with instrumental music, representing important ...
Read More
Music/lyrics composition system considering user's image and music genre
SMC'09: Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics

This paper proposes a music/lyrics composition system consisting of two sections, a lyric composing section and a music composing section, which considers user's image of a song and music genre. First of all, a user has an image of music/lyrics to ...
Read More
Music Information Retrieval System Using Lyrics and Melody Information
APCIP '09: Proceedings of the 2009 Asia-Pacific Conference on Information Processing - Volume 02

Multimedia content can be described in versatile ways as its essence is not limited to one side. For music data these multiple fields could be a song’s audio features as well as its lyrics. But most recent research revolves around melody information for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1
February 2019
265 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3309717
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 February 2019
- Revised: 1 September 2018
- Accepted: 1 September 2018
- Received: 1 March 2018
Published in tomm Volume 15, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Convolutional neural networks
correlation learning between audio and lyrics
cross-modal music retrieval
deep cross-modal models
music knowledge discovery
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 999
  Total Downloads
- Downloads (Last 12 months)113
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning

Music/lyrics composition system considering user's image and music genre

Music Information Retrieval System Using Lyrics and Melody Information