Abstract
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
- Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8325 (MMM’14), Dublin, Ireland. 303--314. Google ScholarDigital Library
- Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13), Atlanta, GA. III--1247--III--1255. Google ScholarDigital Library
- Eric Brochu, Nando de Freitas, and Kejie Bao. 2003. The sound of an album cover: Probabilistic multimedia and information retrieval. In Proceedings of Workshop on Artificial Intelligence and Statistics.Google Scholar
- Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. 2017. Collective deep quantization for efficient cross-modal retrieval. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 3974--3980. Google ScholarDigital Library
- Keunwoo Choi, György Fazekas, and Mark B. Sandler. 2016. Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016. 805--811.Google Scholar
- Yandre M. G. Costa, Luiz S. Oliveira, and Carlos N. Silla. 2017. An evaluation of convolutional neural networks for music classification using spectrograms. Applied Soft Computing 52, C (2017), 28--38. Google ScholarDigital Library
- Olivier Gillet, Slim Essid, and Gal Richard. 2007. On the correlation of automatic audio and visual segmentations of music videos. IEEE Transactions on Circuits and Systems for Video Technology 17, 3 (2007), 347--355. Google ScholarDigital Library
- Philippe Hamel, Matthew E. P. Davies, Kazuyoshi Yoshii, and Masataka Goto. 2013. Transfer learning in MIR: Sharing learned latent representations for music audio classification and similarity. In Proceedings of 14th International Conference on Music Information Retrieval (ISMIR'13), Curitiba, PR, Brazil.Google Scholar
- Wanjia He, Weiran Wang, and Karen Livescu. 2016. Multi-view recurrent neural acoustic word embeddings. CoRR abs/1611.04496 (2016). http://arxiv.org/abs/1611.04496Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA. 131--135.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computing 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
- Sungeun Hong, Woobin Im, and Hyun Seung Yang. 2017. Deep learning for content-based, cross-modal retrieval of videos and music. CoRR abs/1704.06761 (2017). arxiv:1704.06761 http://arxiv.org/abs/1704.06761Google Scholar
- Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarCross Ref
- Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI. 7254--7262.Google ScholarCross Ref
- Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), Honolulu, HI. 3270--3278.Google ScholarCross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84--90. Google ScholarDigital Library
- Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR abs/1607.05368 (2016).Google Scholar
- Honglak Lee, Yan Largman, Peter Pham, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 1096--1104. http://dl.acm.org/citation.cfm?id=2984093.2984217 Google ScholarDigital Library
- Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. 2017. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. CoRR abs/1703.01789 (2017). arxiv:1703.01789 http://arxiv.org/abs/1703.01789Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich. 740--755.Google Scholar
- Ricardo Malheiro, Renato Panda, Paulo Gomes, and Rui Pedro Paiva. 2018. Emotionally-relevant features for classification and regression of music lyrics. IEEE Transactions on Affective Computing 9, 2 (2018), 240--254.Google ScholarCross Ref
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, Maryland. 55--60.Google ScholarCross Ref
- Rudolf Mayer. 2011. Analysing the similarity of album art with self-organising maps. In Proceedings of the 8th International Workshop on Advances in Self-Organizing Maps (WSOM’11), Espoo, Finland, June 13--15, 2011. 357--366. Google ScholarDigital Library
- Matt McVicar, Tim Freeman, and Tijl De Bie. 2011. Mining the correlation between lyrical and audio features and the emergence of mood. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR’11), Miami, Florida. 783--788.Google Scholar
- Rada Mihalcea and Carlo Strapparava. 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12), Jeju Island, Korea. 590--599. Google ScholarDigital Library
- Loris Nanni, Yandre M. G. Costa, Alessandra Lumini, Moo Young Kim, and Seung Ryul Baek. 2016. Combining visual and acoustic features for music genre classification. Expert Systems Applications 45, C (2016), 108--117. Google ScholarDigital Library
- Bruno Nettl. 2000. An ethnomusicologist contemplates universals in musical sound and musical culture. In N.Wallin, B. Merker, and S. Brown, editors. The origins of music. MIT Press, 463--472.Google Scholar
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia (MM'10), Firenze, Italy. 251--260. Google ScholarDigital Library
- Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14), Orlando, FL. 607--616. Google ScholarDigital Library
- Siddharth Sigtia and Simon Dixon. 2014. Improved music feature learning with deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy. 6959--6963.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).Google Scholar
- Alexandros Tsaptsinos. 2017. Lyrics-based music genre classification using a hierarchical attention network. CoRR abs/1707.04678 (2017).Google Scholar
- Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia 18, 7 (2016), 1305--1318. Google ScholarDigital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, Massachusetts. 3441--3450.Google ScholarCross Ref
- Yi Yu, Michel Crucianu, Vincent Oria, and Lei Chen. 2009. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. In Proceedings of the 17th ACM International Conference on Multimedia (MM’09), Beijing, China. 341--350. Google ScholarDigital Library
- Yi Yu, Michel Crucianu, Vincent Oria, and Ernesto Damiani. 2010. Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval. In Proceedings of the 18th ACM International Conference on Multimedia (MM’10), Firenze, Italy. 381--390. Google ScholarDigital Library
- Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video captioning and retrieval models with semantic attention. CoRR abs/1610.02947 (2016).Google Scholar
- Yi Yu, Zhijie Shen, and Roger Zimmermann. 2012. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM International Conference on Multimedia (MM’12), Nara, Japan. 1377--1378. Google ScholarDigital Library
- Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2017. VenueNet: Fine-grained venue discovery by deep correlation learning. In Proceedings of the 19th IEEE International Symposium on Multimedia (ISM’17), Taichung, Taiwan. 3974--3980.Google ScholarCross Ref
- Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems (2018).Google Scholar
- Yi Yu, Roger Zimmermann, Ye Wang, and Vincent Oria. 2013. Scalable content-based music retrieval using chord progression histogram and tree-structure LSH. IEEE Transactions on Multimedia 15, 8 (2013), 1969--1981. Google ScholarDigital Library
- Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, and Kai Xing. 2017. Deep Multi-label Hashing for Large-Scale Visual Search Based on Semantic Graph. In Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (APWeb-WAIM'17), Beijing, China. Springer International Publishing, 169--184.Google ScholarCross Ref
Index Terms
- Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Recommendations
Automatic Lyrics Transcription of Polyphonic Music With Lyrics-Chord Multi-Task Learning
Lyrics are the words that make up a song, while chords are harmonic sets of multiple notes in music. Lyrics and chords are generally essential information in music, i.e. unaccompanied singing vocals mixed with instrumental music, representing important ...
Music/lyrics composition system considering user's image and music genre
SMC'09: Proceedings of the 2009 IEEE international conference on Systems, Man and CyberneticsThis paper proposes a music/lyrics composition system consisting of two sections, a lyric composing section and a music composing section, which considers user's image of a song and music genre. First of all, a user has an image of music/lyrics to ...
Music Information Retrieval System Using Lyrics and Melody Information
APCIP '09: Proceedings of the 2009 Asia-Pacific Conference on Information Processing - Volume 02Multimedia content can be described in versatile ways as its essence is not limited to one side. For music data these multiple fields could be a song’s audio features as well as its lyrics. But most recent research revolves around melody information for ...
Comments