Skip to main content
Top
Published in: International Journal of Multimedia Information Retrieval 1/2018

01-12-2017 | Regular Paper

Estimating the information gap between textual and visual representations

Authors: Christian Henning, Ralph Ewerth

Published in: International Journal of Multimedia Information Retrieval | Issue 1/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In fact, there was even an Oscar awarded for Best Writing—Title Cards in the first Academy Awards ceremony in 1929, but there was never again an award for intertitles.
 
2
It is assumed that image and text are jointly placed on purpose.
 
3
This would also allow us to view the CMI relation of captioning samples as an inclusion, since the text does not express concepts that are not covered by the image.
 
5
The remaining samples are allocated for future usage.
 
6
A suitable value for weight-decay has been found via grid search.
 
7
Note, concepts should be weighted by their visualness (cmp. Sect. 4.4).
 
Literature
1.
go back to reference Agosti M, Fuhr N, Toms E, Vakkari P (2014) Evaluation methodologies in information retrieval (Dagstuhl Seminar 13441). Dagstuhl Rep 3(10):92–126 Agosti M, Fuhr N, Toms E, Vakkari P (2014) Evaluation methodologies in information retrieval (Dagstuhl Seminar 13441). Dagstuhl Rep 3(10):92–126
2.
go back to reference Barnard K, Yanai K (2006) Mutual information of words and pictures. Inf Theory Appl 2:1–5 Barnard K, Yanai K (2006) Mutual information of words and pictures. Inf Theory Appl 2:1–5
3.
go back to reference Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei D, Jordan M (2003) Matching words and pictures. J Mach Learn Res 3(2):1107–1135MATH Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei D, Jordan M (2003) Matching words and pictures. J Mach Learn Res 3(2):1107–1135MATH
4.
go back to reference Bateman J (2014) Text and image: a critical introduction to the visual/verbal divide. Routledge, London Bateman J (2014) Text and image: a critical introduction to the visual/verbal divide. Routledge, London
5.
go back to reference Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollár P, Zitnick L (2015) Microsoft COCO captions: data collection and evaluation server. arxiv:1504.00325 Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollár P, Zitnick L (2015) Microsoft COCO captions: data collection and evaluation server. arxiv:​1504.​00325
6.
go back to reference Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(12):265–292MATH Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(12):265–292MATH
7.
go back to reference Eickhoff C, Teevan J, White R, Dumais S (2014) Lessons from the journey: a query log analysis of within-session learning. In: Proceedings of the 7th ACM international conference on web search and data mining, pp 223–232 Eickhoff C, Teevan J, White R, Dumais S (2014) Lessons from the journey: a query log analysis of within-session learning. In: Proceedings of the 7th ACM international conference on web search and data mining, pp 223–232
8.
go back to reference Feng Y, Lapata M (2008) Automatic image annotation using auxiliary text information. In: Proceedings of Association for Computational Linguistics, vol 8, pp 272–280 Feng Y, Lapata M (2008) Automatic image annotation using auxiliary text information. In: Proceedings of Association for Computational Linguistics, vol 8, pp 272–280
9.
go back to reference Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812CrossRef Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812CrossRef
10.
go back to reference Frome A, Corrado G, Shlens J, Bengio S, Dean J, Ranzato MA, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, vol 26, pp 2121–2129 Frome A, Corrado G, Shlens J, Bengio S, Dean J, Ranzato MA, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, vol 26, pp 2121–2129
11.
go back to reference Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of European conference on computer vision, vol 13, pp 529–545 Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of European conference on computer vision, vol 13, pp 529–545
12.
go back to reference Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of neural information processing systems, vol 26, pp 2672–2680 Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of neural information processing systems, vol 26, pp 2672–2680
13.
go back to reference Izadinia H, Sadeghi F, Divvala S, Hajishirzi H, Choi Y, Farhadi A (2015) Segment-phrase table for semantic segmentation, visual entailment and paraphrasing. In: Proceedings of the IEEE international conference on computer vision, pp 10–18 Izadinia H, Sadeghi F, Divvala S, Hajishirzi H, Choi Y, Farhadi A (2015) Segment-phrase table for semantic segmentation, visual entailment and paraphrasing. In: Proceedings of the IEEE international conference on computer vision, pp 10–18
16.
go back to reference Liu W, Tang X (2005) Learning an image-word embedding for image auto-annotation on the nonlinear latent space. In: Proceedings of ACM international conference on multimedia, vol 13, pp 451–454 Liu W, Tang X (2005) Learning an image-word embedding for image auto-annotation on the nonlinear latent space. In: Proceedings of ACM international conference on multimedia, vol 13, pp 451–454
17.
18.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of neural information processing systems, vol 26, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of neural information processing systems, vol 26, pp 3111–3119
19.
go back to reference Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A (2011) Multimodal deep learning. In: Proceedings of international conference on machine learning, vol 28, pp 689–696 Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A (2011) Multimodal deep learning. In: Proceedings of international conference on machine learning, vol 28, pp 689–696
20.
go back to reference Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:​1511.​06434
21.
go back to reference Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2016) Breakingnews: Article annotation by image and text processing. arXiv arXiv:1603.07141 Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2016) Breakingnews: Article annotation by image and text processing. arXiv arXiv:​1603.​07141
22.
23.
go back to reference Vakkari P (2016) Searching as learning: a systematization based on literature. J Inf Sci 42(1):7–18CrossRef Vakkari P (2016) Searching as learning: a systematization based on literature. J Inf Sci 42(1):7–18CrossRef
25.
go back to reference Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRef Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRef
26.
go back to reference Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212 Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
27.
go back to reference Xue J, Du Y, Shui H (2015) Semantic correlation mining between images and texts with global semantics and local mapping. In: Proceedings of international conference on multimedia modeling, vol 8936, pp 427–435 Xue J, Du Y, Shui H (2015) Semantic correlation mining between images and texts with global semantics and local mapping. In: Proceedings of international conference on multimedia modeling, vol 8936, pp 427–435
28.
go back to reference Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450 Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450
29.
go back to reference Yanai K, Barnard K (2005) Image region entropy: a measure of visualness of web images associated with one concept. In: Proceedings of the annual ACM international conference on multimedia, vol 13, pp 419–422 Yanai K, Barnard K (2005) Image region entropy: a measure of visualness of web images associated with one concept. In: Proceedings of the annual ACM international conference on multimedia, vol 13, pp 419–422
30.
go back to reference Zhang Y, Schneider J, Dubrawski A (2008) Learning the semantic correlation: An alternative way to gain from unlabeled text. In: Proceedings of the international conference on neural information processing systems, vol 21, pp 1945–1952 Zhang Y, Schneider J, Dubrawski A (2008) Learning the semantic correlation: An alternative way to gain from unlabeled text. In: Proceedings of the international conference on neural information processing systems, vol 21, pp 1945–1952
31.
go back to reference Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10(2):221–229CrossRef Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10(2):221–229CrossRef
Metadata
Title
Estimating the information gap between textual and visual representations
Authors
Christian Henning
Ralph Ewerth
Publication date
01-12-2017
Publisher
Springer London
Published in
International Journal of Multimedia Information Retrieval / Issue 1/2018
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-017-0142-y

Other articles of this Issue 1/2018

International Journal of Multimedia Information Retrieval 1/2018 Go to the issue

Premium Partner