Top

International Journal of Computer Vision

Published in:

01-01-2014

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Authors: Yunchao Gong, Qifa Ke, Michael Isard, Svetlana Lazebnik

Published in: International Journal of Computer Vision | Issue 2/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

previous article Smoke Detection in Video: An Image Separation Approach

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

It can be shown that CCA with labels as one of the views is equivalent to Linear Discriminant Analysis (LDA) (Bartlett 1938).

Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.MATHMathSciNet

Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.MathSciNet

Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In ICCV (Vol. 2, pp. 408–415).

Bartlett, M. S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society, 34(1), 33–40.CrossRef

Berg, T., & Forsyth, D. (2006). Animals on the web. In CVPR.

Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Second workshop on Internet vision at CVPR.

Blaschko, M., & Lampert, C. (2008). Correlational spectral clustering. In CVPR.

Blei, D., & Jordan, M. (2003). Modeling annotated data. In ACM SIGIR (pp. 127–134).

Blei, D., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATH

Carneiro, G., Chan, A., Moreno, P., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. In PAMI.

Chapelle, O., Weston, J., & Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. In NIPS.

Chen, N., Zhu, J., Sun, F., & Xing, E. P. (2012). Large-margin predictive latent subspace learning for multi-view data analysis. In PAMI.

Chen, X., Yuan, X.-T., Chen, Q., Yan, S., & Chua, T.-S. (2011). Multi-label visual classification with label exclusive context. In ICCV.

Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y.-T. (2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM conference on image and video retrieval (CIVR’09), Santorini, Greece.

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2), 1–60.CrossRef

Deng, J., Dong, W., Socher, R., Li, L., & Li, K. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV.

Fan, J., Shen, Y., Zhou, N., & Gao, Y. (2010). Harvesting large-scale weakly-tagged image databases from the web. In CVPR (pp. 802–809).

Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. A. (2010). Every picture tells a story: Generating sentences for images. In ECCV.

Foster, D. P., Johnson, R., Kakade, S. M., & Zhang, T. (2010). Multi-view dimensionality reduction via canonical correlation analysis. Tech Report. Rutgers University.

Frankel, C., Swain, M. J., & Athitsos, V. (1997). Webseer: An image search engine for the World Wide Web. In CVPR.

Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.

Gong, Y., & Lazebnik, S. (2011). Iterative quantization: An procrustean approach to learning binary codes. In CVPR.

Globerson, A., & Roweis, S. (2005). Metric Learning by collapsing classes. In NIPS.

Goldberger, J., Roweis, S., & Hinton, G. (2004). Neighbourhood components analysis. In NIPS.

Grangier, D., & Bengio, S. (2008). A discriminative kernel-based model to rank images from text queries. In PAMI.

Grubinger, M., Clough, P. D., Müller, H., & Deselaers, T. (2006). The IAPR TC-12 benchmark—A new evaluation resource for visual information systems. In Proceedings of the international workshop OntoImage’2006 language resources for content-based image retrieval (pp. 13–23).

Gordo, A., Rodriguez-Serrano, J., Perronnin, F., & Valveny, E. (2012). Leveraging category-level labels for instance-level image retrieval. In CVPR.

Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009). TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV.

Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.

Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application. Neural Computation, 16(12), 2639–2664.CrossRefMATH

Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR.

Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.

Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In NIPS.

Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.

Hwang, S. J., & Grauman, K. (2011). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. In IJCV.

Krapac, J., Allan, M., Verbeek, J., & Jurie, F. (2010). Improving web-image search results using query-relative classifiers. In CVPR.

Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Tech Report. University of Toronto.

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Babytalk: Understanding and generating simple image descriptions. In CVPR.

Larsen, R. M. (1998). Lanczos bidiagonalization with partial reorthogonalization. Technical report, Department of Computer Science, Aarhus University

Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.

Lazebnik, S., Schmid, S., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

Li, J., & Wang, J. (2008). Real-time computerized annotation of pictures. In PAMI.

Liu, C., Yuen, J., & Torralba, A. (2010). Sift flow: Dense correspondence across difference scenes. In PAMI.

Liu, Y., Xu, D., Tsang, I., & Luo, J. (2009). Using large-scale web data to facilitate textual query based retrieval of consumer photos. In ACM MM.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV.

Lucchi, A., & Weston, J. (2012). Joint image and word sense discrimination for image retrieval. In ECCV.

Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In CVPR.

Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.

Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.

Monay, F., & Gatica-Perez, D. (2004). PLSA-based image auto-annotation: Constraining the latent space. In ACM Multimedia.

Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In NIPS.

Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.

Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

Perronnin, F., Sanchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.

Quadrianto, N., & Lampert, C. H. (2011). Learning multi-view neighborhood preserving projections. In ICML.

Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.

Raguram, R., & Lazebnik, S. (2008). Computing iconic summaries for general visual concepts. In First workshop on Internet vision at CVPR.

Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In NIPS.

Rai, P., & Daumé, H. (2009). Multi-label prediction via sparse infinite CCA. In NIPS.

Rasiwasia, N., & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5), 923–938.CrossRef

Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. In ACM MM.

Scholkopf, B., Smola, A., & Muller, K.-R. (1997). Kernel principal component analysis. In ICANN.

Schroff, F., Criminisi, A., & Zisserman, A. (2007). Harvesting image databases from the Web. In ICCV.

Sharma, A., Kumar, A., Daumé, H., & Jacobs, D. (2012). Generalized multiview analysis: A discriminative latent space. In CVPR.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. In PAMI.

Smeulders, A. W., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349–1380.CrossRef

Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.

Udupa, R., & Khapra, M. (2010). Improving the multilingual user experience of Wikipedia using cross-language name search. In NAACL.

van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. In PAMI.

Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.

Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In ECCV.

Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In NIPS.

von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In ACM SIGCHI.

Wang, C., Blei, D., & Li, F. (2009a). Simultaneous image classification and annotation. In CVPR (pp. 1903–1910).

Wang, G., Hoiem, D., & Forsyth, D. (2009b). Building text features for object image classification. In CVPR.

Wang, G., Hoiem, D., & Forsyth, D. (2009c). Learning image similarity from Flickr groups using stochastic intersection kernel machines. In ICCV.

Wang, X.-J., Zhang, L., Li, X., & Ma, W.-Y. (2008). Annotating images by mining image search results. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1919–1932.

Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI.

Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In SIGIR.

Weinberger, K., Blitzer, J., & Saul, L. (2005). Distance metric learning for large margin nearest neighbor classification. In NIPS.

Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.

Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR.

Yakhnenko, O., & Honavar, V. (2009). Multiple label prediction for image annotation with multiple kernel correlation models. In Workshop on visual context learning (in conjunction with CVPR).

Zhang, Y., & Schneider, J. (2011). Multi-label output codes using canonical correlation analysis. In AISTATS.

Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classication using maximum entropy method. In ACM SIGIR.

Title: A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics
Authors: Yunchao Gong
Qifa Ke
Michael Isard
Svetlana Lazebnik
Publication date: 01-01-2014
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 2/2014
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-013-0658-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 2/2014

Closed-Form Solution of Visual-Inertial Structure from Motion

A Super-Resolution Framework for High-Accuracy Multiview Reconstruction

A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them

A Wavelet Perspective on Variational Perceptually-Inspired Color Enhancement

Smoke Detection in Video: An Image Separation Approach

Premium Partner