skip to main content
research-article

Effective multi-modal retrieval based on stacked auto-encoders

Published:01 April 2014Publication History
Skip Abstract Section

Abstract

Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search relevant reviews and trailers. To solve the problem, a set of mapping functions are learned to project high-dimensional features extracted from data of different media types into a common low-dimensional space so that metric distance measures can be applied. In this paper, we propose an effective mapping mechanism based on deep learning (i.e., stacked auto-encoders) for multi-modal retrieval. Mapping functions are learned by optimizing a new objective function, which captures both intra-modal and inter-modal semantic relationships of data from heterogeneous sources effectively. Compared with previous works which require a substantial amount of prior knowledge such as similarity matrices of intra-modal data and ranking examples, our method requires little prior knowledge. Given a large training dataset, we split it into mini-batches and continually adjust the mapping functions for each batch of input. Hence, our method is memory efficient with respect to the data volume. Experiments on three real datasets illustrate that our proposed method achieves significant improvement in search accuracy over the state-of-the-art methods.

References

  1. M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR, pages 3594--3601, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  2. T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR'09), Santorini, Greece., July 8-10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232--1240, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Goroshin and Y. LeCun. Saturating auto-encoder. CoRR, abs/1301.3577, 2013.Google ScholarGoogle Scholar
  5. G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report, 2010.Google ScholarGoogle Scholar
  6. G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. Database Syst., 28(4):517--580, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In Multimedia Information Retrieval, pages 39--43, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.Google ScholarGoogle Scholar
  9. S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In IJCAI, pages 1360--1365, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient BackProp. In G. Orr and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, chapter 2, pages 9--50. Springer Berlin Heidelberg, Berlin, Heidelberg, Mar. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Lu, F. Wu, S. Tang, Z. Zhang, X. He, and Y. Zhuang. A low rank structural large margin method for cross-modal ranking. In SIGIR, pages 433--442, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng. Recurrent neural networks for noise reduction in robust asr. In INTERSPEECH, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval, pages 151--175. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689--696, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM Multimedia, pages 251--260, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, pages 833--840, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969--978, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP, pages 151--161, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD Conference, pages 785--796, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2231--2239, 2012.Google ScholarGoogle Scholar
  22. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096--1103, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753--1760, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In KDD, pages 940--948, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear cross-modal hashing for efficient multimodal search. MM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Zhuang, Y. Yang, and F. Wu. Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia, 10(2):221--229, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Effective multi-modal retrieval based on stacked auto-encoders
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 7, Issue 8
      April 2014
      60 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 April 2014
      Published in pvldb Volume 7, Issue 8

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader