ABSTRACT
Multimodal datasets contain an enormous amount of relational information, which grows exponentially with the introduction of new modalities. Learning representations in such a scenario is inherently complex due to the presence of multiple heterogeneous information channels. These channels can encode both (a) inter-relations between the items of different modalities and (b) intra-relations between the items of the same modality. Encoding multimedia items into a continuous low-dimensional semantic space such that both types of relations are captured and preserved is extremely challenging, especially if the goal is a unified end-to-end learning framework. The two key challenges that need to be addressed are: 1) the framework must be able to merge complex intra and inter relations without losing any valuable information and 2) the learning model should be invariant to the addition of new and potentially very different modalities. In this paper, we propose a flexible framework which can scale to data streams from many modalities. To that end we introduce a hypergraph-based model for data representation and deploy Graph Convolutional Networks to fuse relational information within and across modalities. Our approach provides an efficient solution for distributing otherwise extremely computationally expensive or even unfeasible training processes across multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new modalities to our model requires only an additional GPU unit keeping the computational time unchanged, which brings representation learning to truly multimodal datasets. We demonstrate the feasibility of our approach in the experiments on multimedia datasets featuring second, third and fourth order relations.
- Devanshu Arya and Marcel Worring. 2018. Exploiting Relational Information in Social Networks using Geometric Deep Learning on Hypergraphs. In Proceedings of the 2018 ACM International Conference on Multimedia Retrieval. ACM, 117--125.Google ScholarDigital Library
- Anirban Banerjee, Arnab Char, and Bibhash Mondal. 2017. Spectra of general hypergraphs. Linear Algebra Appl., Vol. 518 (2017), 14--30.Google ScholarCross Ref
- Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. 2016. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems. 3189--3197.Google Scholar
- Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, Vol. 34, 4 (2017), 18--42.Google ScholarCross Ref
- Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014 .Google Scholar
- Jiajun Bu, Shulong Tan, Chun Chen, Can Wang, Hao Wu, Lijun Zhang, and Xiaofei He. 2010. Music recommendation by unified hypergraph: combining social media information and music content. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 391--400.Google ScholarDigital Library
- Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. 2015. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 119--128.Google ScholarDigital Library
- Peng Cui, Shao-Wei Liu, Wen-Wu Zhu, Huan-Bo Luan, Tat-Seng Chua, and Shi-Qiang Yang. 2014. Social-sensed image search. ACM Transactions on Information Systems (TOIS), Vol. 32, 2 (2014), 8.Google ScholarDigital Library
- Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844--3852.Google Scholar
- David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems. 2224--2232.Google Scholar
- Richard A Harshman et almbox. 1970. Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis. (1970).Google Scholar
- Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, and Zhonghua Zhao. 2018. Multimodal network embedding via attention based multi-view variational autoencoder. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 108--116.Google ScholarDigital Library
- Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 39--43.Google ScholarDigital Library
- CG Khatri and C Radhakrishna Rao. 1968. Solutions to some functional equations and their applications to characterization of probability distributions. Sankhy=a: The Indian Journal of Statistics, Series A (1968), 167--180.Google Scholar
- Hyon-Jung Kim, Esa Ollila, Visa Koivunen, and Christophe Croux. 2013. Robust and sparse estimation of tensor decompositions. In 2013 IEEE Global Conference on Signal and Information Processing. IEEE, 965--968.Google ScholarCross Ref
- Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. Proceedings of the International Conference on Learning Representations (2017).Google Scholar
- Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review, Vol. 51, 3 (2009), 455--500.Google Scholar
- Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. 2018. Canonical Tensor Decomposition for Knowledge Base Completion. In International Conference on Machine Learning. 2869--2878.Google Scholar
- Dong Li, Zhiming Xu, Sheng Li, and Xin Sun. 2013. Link prediction in social networks based on hypergraph. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 41--42.Google ScholarDigital Library
- Hang Li, Haozheng Wang, Zhenglu Yang, and Masato Odagaki. 2017. Variation autoencoder based network representation learning for classification. In Proceedings of ACL 2017, Student Research Workshop. 56--61.Google ScholarCross Ref
- Wu-Jun Li and Dit-Yan Yeung. 2009. Relation regularized matrix factorization. In Twenty-First International Joint Conference on Artificial Intelligence .Google Scholar
- Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees GM Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR), Vol. 49, 1 (2016), 14.Google ScholarDigital Library
- Zechao Li and Jinhui Tang. 2016. Weakly supervised deep matrix factorization for social image understanding. IEEE Transactions on Image Processing, Vol. 26, 1 (2016), 276--288.Google ScholarDigital Library
- Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence (2018).Google Scholar
- Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, and Hirokazu Anai. 2018. Learning multi-way relations via tensor decomposition with neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
- Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. 2015. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops. 37--45.Google ScholarDigital Library
- Julian McAuley and Jure Leskovec. 2012. Image labeling on a network: using social-network metadata for image classification. In European conference on computer vision. Springer, 828--841.Google ScholarDigital Library
- Bradley N Miller, Istvan Albert, Shyong K Lam, Joseph A Konstan, and John Riedl. 2003. MovieLens unplugged: experiences with an occasionally connected recommender system. In Proceedings of the 8th international conference on Intelligent user interfaces. ACM, 263--266.Google ScholarDigital Library
- Federico Monti, Michael Bronstein, and Xavier Bresson. 2017. Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems. 3697--3707.Google Scholar
- Atsuhiro Narita, Kohei Hayashi, Ryota Tomioka, and Hisashi Kashima. 2012. Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery, Vol. 25, 2 (2012), 298--324.Google ScholarDigital Library
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689--696.Google ScholarDigital Library
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 251--260.Google ScholarDigital Library
- Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science, Vol. 290, 5500 (2000), 2323--2326.Google Scholar
- Stevan Rudinac, Iva Gornishka, and Marcel Worring. 2017. Multimodal Classification of Violent Online Political Extremism Content with Graph Convolutional Networks. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 245--252.Google ScholarDigital Library
- Jitao Sang, Jing Liu, and Changsheng Xu. 2011. Exploiting user information for image tag refinement. In Proceedings of the 19th ACM international conference on Multimedia. ACM, 1129--1132.Google ScholarDigital Library
- David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83 -- 98 (2013).Google ScholarCross Ref
- Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.Google Scholar
- Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems. 2222--2230.Google Scholar
- Gjorgji Strezoski and Marcel Worring. 2017. Omniart: multi-task deep learning for artistic data analysis. arXiv preprint arXiv:1708.00684 (2017).Google Scholar
- Gjorgji Strezoski and Marcel Worring. 2018. OmniArt: A Large-scale Artistic Benchmark. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 14, 4 (2018), 88.Google Scholar
- Jinhui Tang, Zechao Li, Meng Wang, and Ruizhen Zhao. 2015a. Neighborhood discriminant hashing for large-scale image retrieval. IEEE Transactions on Image Processing, Vol. 24, 9 (2015), 2827--2840.Google ScholarDigital Library
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015b. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, 1067--1077.Google ScholarDigital Library
- Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, and Qi Tian. 2019. Social Anchor-Unit Graph Regularized Tensor Completion for Large-Scale Image Retagging. IEEE transactions on pattern analysis and machine intelligence (2019).Google ScholarCross Ref
- Jinhui Tang, Xiangbo Shu, Guo-Jun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2017. Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 8 (2017), 1662--1674.Google ScholarDigital Library
- Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.Google Scholar
- Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International Conference on Machine Learning. 2071--2080.Google ScholarDigital Library
- Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV). 399--417.Google ScholarDigital Library
- Michael M Wolf, Alicia M Klinvex, and Daniel M Dunlavy. 2016. Advantages to modeling relational data using hypergraphs versus graphs. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.Google ScholarCross Ref
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441--3450.Google ScholarCross Ref
- Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015. Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence.Google ScholarDigital Library
- Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2017. User Profile Preserving Social Network Embedding.. In IJCAI. 3378--3384.Google Scholar
- Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2007. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in neural information processing systems. 1601--1608.Google Scholar
Index Terms
- HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities
Recommendations
Adapt and explore: Multimodal mixup for representation learning
AbstractResearch on general multimodal systems has gained significant attention due to the proliferation of multimodal data in the real world. Despite the remarkable performance achieved by existing multimodal representation learning schemes, missing ...
Highlights- Innovatively introducing mixup strategy to multimodal representation learning.
- Conducting multimodal mixup through adapting and exploring steps.
- Mixing negative samples in multimodal contrastive learning.
- Improving the ...
How to Sense the World: Leveraging Hierarchy in Multimodal Perception for Robust Reinforcement Learning Agents
AAMAS '22: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent SystemsThis work addresses the problem of sensing the world: how to learn a multimodal representation of a reinforcement learning agent's environment that allows the execution of tasks under incomplete perceptual conditions. To address such problem, we argue ...
Learning from the global view: Supervised contrastive learning of multimodal representation
AbstractThe development of technology enables the availability of abundant multimodal data, which can be utilized in many representation learning tasks. However, most methods ignore the rich modality correlation information stored in each multimodal ...
Highlights- Proposing global contrastive learning based on multimodal representation.
- Devising multiple techniques to define the negatives/positives for each anchor.
- Leveraging label information to conduct supervised contrastive learning.
- ...
Comments