skip to main content
research-article

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

Authors Info & Claims
Published:31 March 2021Publication History
Skip Abstract Section

Abstract

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

References

  1. Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. 2016. Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59–68.Google ScholarGoogle ScholarCross RefCross Ref
  2. Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.Google ScholarGoogle ScholarCross RefCross Ref
  3. Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862.Google ScholarGoogle Scholar
  4. Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875.Google ScholarGoogle Scholar
  5. Lamberto Ballan, Tiberio Uricchio, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. A cross-media model for automatic image annotation. In Proceedings of the Annual ACM International Conference on Multimedia Retrieval (ICMR’14). 73:73–73:80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1043–1052.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jingze Chi and Yuxin Peng. 2018. Dual adversarial networks for zero-shot cross-media retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18). 256–262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Chi and Y. Peng. 2020. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circ. Syst. Vid. Technol. 30, 4 (2020), 1173–1187.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Content-based Image and Video Retrieval (CIVR’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 215–223.Google ScholarGoogle Scholar
  11. F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM Multimedia Conference. 7–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210–233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5767–5777. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.Google ScholarGoogle Scholar
  16. Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2018. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybernet. 14, 6 (2018), 143–156.Google ScholarGoogle Scholar
  17. Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17, 3 (2015), 370–381.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.Google ScholarGoogle Scholar
  19. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.Google ScholarGoogle ScholarCross RefCross Ref
  20. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 453–465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. B. L. Larsen, S. K. Sønderby, and O. Winther. 2015. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning. 1558–1566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (ICML’14). 1188–1196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Li, N. Dimitrova, M. Li, and I. K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM Multimedia Conference. 604–611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. 604–611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. 2020. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 11515–11522.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. arXiv:1703.03567. Retrieved from https://arxiv.org/abs/1703.03567.Google ScholarGoogle Scholar
  27. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.Google ScholarGoogle Scholar
  28. Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846–3853. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. Trans. Multimedia Comput. Commu. Appl. 15, 1 (2019), 22:1–22:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4094–4102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Rashtchian, M. Young, P. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 674–686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’10). 251–260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the International Conference on Machine Learning. 2152–2161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. 2019. Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2168–2178.Google ScholarGoogle ScholarCross RefCross Ref
  36. Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. (2020). 1--16. https://ieeexplore.ieee.org/document/8974240/.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  39. Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems. 935–943. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Srivastava and R. Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Workshop.Google ScholarGoogle Scholar
  41. N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems. 2222–2230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199–1208.Google ScholarGoogle Scholar
  43. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 1096–1103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2011), 2010–2023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision. 2088–2095. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. arXiv:2006.08159. Retrieved from https://arxiv.org/abs/2006.08159.Google ScholarGoogle Scholar
  48. Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Process. 26, 3 (2017), 1393–1404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2017), 449–460.Google ScholarGoogle Scholar
  50. Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2018. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2018), 1412–1424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Process. 28, 4 (2019), 1602–1612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 5542–5551.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-VAEGAN-D2: A feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10275–10284.Google ScholarGoogle ScholarCross RefCross Ref
  54. Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2020. Cross-modal common representations by private-shared subspaces separation. IEEE Trans. Cybernet. (2020), 1–14. https://ieeexplore.ieee.org/document/9165187.Google ScholarGoogle Scholar
  56. Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans. Cybernet. 50, 6 (2020), 2400–2413.Google ScholarGoogle ScholarCross RefCross Ref
  57. Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial semantic learning network for extendable cross-modal retrieval. In Proceedings of the ACM Annual International Conference on Multimedia Retrieval (ICMR’18). 46–54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441–3450.Google ScholarGoogle Scholar
  59. X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Vid. Technol. 24, 6 (2014), 965–978. Google ScholarGoogle ScholarCross RefCross Ref
  60. Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.Google ScholarGoogle ScholarCross RefCross Ref
  61. Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004–1013.Google ScholarGoogle ScholarCross RefCross Ref
  62. Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1070–1076. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 1s
        January 2021
        353 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3453990
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 March 2021
        • Accepted: 1 September 2020
        • Revised: 1 August 2020
        • Received: 1 April 2020
        Published in tomm Volume 17, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format