Abstract
Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.
- Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59–68.Google ScholarCross Ref
- Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.Google ScholarCross Ref
- Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862.Google Scholar
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875.Google Scholar
- Lamberto Ballan, Tiberio Uricchio, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. A cross-media model for automatic image annotation. In Proceedings of the Annual ACM International Conference on Multimedia Retrieval (ICMR’14). 73:73–73:80. Google ScholarDigital Library
- Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1043–1052.Google ScholarCross Ref
- Jingze Chi and Yuxin Peng. 2018. Dual adversarial networks for zero-shot cross-media retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 256–262. Google ScholarDigital Library
- J. Chi and Y. Peng. 2020. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circ. Syst. Vid. Technol. 30, 4 (2020), 1173–1187.Google ScholarCross Ref
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Content-based Image and Video Retrieval (CIVR’09). Google ScholarDigital Library
- Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 215–223.Google Scholar
- F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM Multimedia Conference. 7–16. Google ScholarDigital Library
- Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210–233. Google ScholarDigital Library
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5767–5777. Google ScholarDigital Library
- D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664. Google ScholarDigital Library
- Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.Google Scholar
- Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2018. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybernet. 14, 6 (2018), 143–156.Google Scholar
- Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17, 3 (2015), 370–381.Google ScholarDigital Library
- Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.Google Scholar
- Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.Google ScholarCross Ref
- Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465. Google ScholarDigital Library
- A. B. L. Larsen, S. K. Sønderby, and O. Winther. 2015. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning. 1558–1566. Google ScholarDigital Library
- Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (ICML’14). 1188–1196. Google ScholarDigital Library
- D. Li, N. Dimitrova, M. Li, and I. K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM Multimedia Conference. 604–611. Google ScholarDigital Library
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. 604–611. Google ScholarDigital Library
- Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. 2020. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 11515–11522.Google ScholarCross Ref
- Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. arXiv:1703.03567. Retrieved from https://arxiv.org/abs/1703.03567.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.Google Scholar
- Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846–3853. Google ScholarDigital Library
- Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. Trans. Multimedia Comput. Commu. Appl. 15, 1 (2019), 22:1–22:24. Google ScholarDigital Library
- Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420. Google ScholarDigital Library
- Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094–4102. Google ScholarDigital Library
- C. Rashtchian, M. Young, P. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 674–686. Google ScholarDigital Library
- N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’10). 251–260. Google ScholarDigital Library
- Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the International Conference on Machine Learning. 2152–2161. Google ScholarDigital Library
- Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. 2019. Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2168–2178.Google ScholarCross Ref
- Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. (2020). 1--16. https://ieeexplore.ieee.org/document/8974240/.Google ScholarCross Ref
- Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems. 935–943. Google ScholarDigital Library
- N. Srivastava and R. Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Workshop.Google Scholar
- N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems. 2222–2230. Google ScholarDigital Library
- Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199–1208.Google Scholar
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 1096–1103. Google ScholarDigital Library
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162. Google ScholarDigital Library
- K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2011), 2010–2023. Google ScholarDigital Library
- K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision. 2088–2095. Google ScholarDigital Library
- Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. arXiv:2006.08159. Retrieved from https://arxiv.org/abs/2006.08159.Google Scholar
- Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Process. 26, 3 (2017), 1393–1404. Google ScholarDigital Library
- Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449–460.Google Scholar
- Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2018. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2018), 1412–1424.Google ScholarDigital Library
- Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Process. 28, 4 (2019), 1602–1612.Google ScholarDigital Library
- Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 5542–5551.Google ScholarCross Ref
- Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-VAEGAN-D2: A feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10275–10284.Google ScholarCross Ref
- Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672. Google ScholarDigital Library
- Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2020. Cross-modal common representations by private-shared subspaces separation. IEEE Trans. Cybernet. (2020), 1–14. https://ieeexplore.ieee.org/document/9165187.Google Scholar
- Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans. Cybernet. 50, 6 (2020), 2400–2413.Google ScholarCross Ref
- Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial semantic learning network for extendable cross-modal retrieval. In Proceedings of the ACM Annual International Conference on Multimedia Retrieval (ICMR’18). 46–54. Google ScholarDigital Library
- F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441–3450.Google Scholar
- X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Vid. Technol. 24, 6 (2014), 965–978. Google ScholarCross Ref
- Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.Google ScholarCross Ref
- Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004–1013.Google ScholarCross Ref
- Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1070–1076. Google ScholarDigital Library
Index Terms
- Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network
Recommendations
Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalZero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The ...
Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalThe goal of cross-modal retrieval is to search for semantically similar instances in one modality by using a query from another modality. Existing approaches mainly consider the standard scenario that requires the source set for training and the target ...
Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval
AbstractZero-shot sketch-based image retrieval (ZS-SBIR) is an extension of sketch-based image retrieval (SBIR) that aims to search relevant images with query sketches of the unseen categories. Most previous methods focus more on preserving semantic ...
Comments