Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 1/2019

12.01.2019 | Regular Paper

Joint embeddings with multimodal cues for video-text retrieval

verfasst von: Niluthpol C. Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255 Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255
2.
Zurück zum Zitat Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVID Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVID
3.
Zurück zum Zitat Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900 Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900
4.
5.
Zurück zum Zitat Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197 Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197
6.
Zurück zum Zitat Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6 Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6
7.
Zurück zum Zitat Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733 Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733
8.
9.
Zurück zum Zitat Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200 Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200
10.
Zurück zum Zitat Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669 Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669
11.
Zurück zum Zitat Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:​1412.​3555
12.
Zurück zum Zitat Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838 Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:​1604.​06838
13.
Zurück zum Zitat Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC) Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC)
14.
Zurück zum Zitat Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29 Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29
15.
Zurück zum Zitat Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16 Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16
16.
Zurück zum Zitat Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548CrossRef Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548CrossRef
17.
Zurück zum Zitat Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129 Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
18.
Zurück zum Zitat Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRef Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRef
19.
Zurück zum Zitat Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATH Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATH
20.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778
21.
Zurück zum Zitat Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22 Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22
22.
Zurück zum Zitat Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefMATH Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefMATH
23.
Zurück zum Zitat Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269 Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269
24.
Zurück zum Zitat Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318 Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318
25.
Zurück zum Zitat Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137
26.
Zurück zum Zitat Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897 Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897
28.
Zurück zum Zitat Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:​1411.​2539
29.
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302 Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302
30.
Zurück zum Zitat Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446 Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446
31.
Zurück zum Zitat Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276 Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276
32.
Zurück zum Zitat Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178 Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178
33.
Zurück zum Zitat Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867 Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867
34.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
35.
Zurück zum Zitat Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrieval Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrieval
36.
Zurück zum Zitat Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241 Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241
37.
Zurück zum Zitat Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570 Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570
38.
Zurück zum Zitat Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimedia Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimedia
39.
Zurück zum Zitat Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307 Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307
40.
Zurück zum Zitat Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667 Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667
41.
Zurück zum Zitat Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602 Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602
42.
Zurück zum Zitat Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRef Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRef
43.
Zurück zum Zitat Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72CrossRef Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72CrossRef
44.
Zurück zum Zitat TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054 TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054
45.
Zurück zum Zitat Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096 Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096
46.
Zurück zum Zitat Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788 Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788
47.
Zurück zum Zitat Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
48.
Zurück zum Zitat Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823
49.
Zurück zum Zitat Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973 Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973
50.
Zurück zum Zitat Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:​1609.​08124
51.
Zurück zum Zitat Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064 Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064
52.
Zurück zum Zitat Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representations Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representations
53.
Zurück zum Zitat Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542 Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542
54.
Zurück zum Zitat Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346 Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346
55.
Zurück zum Zitat Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419 Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419
56.
Zurück zum Zitat Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23CrossRef Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23CrossRef
57.
Zurück zum Zitat Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162 Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162
58.
Zurück zum Zitat Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407 Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407
59.
Zurück zum Zitat Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013 Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013
60.
Zurück zum Zitat Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296 Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296
61.
Zurück zum Zitat Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6 Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6
62.
Zurück zum Zitat Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450 Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450
63.
Zurück zum Zitat Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555 Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555
64.
Zurück zum Zitat Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915 Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915
65.
Zurück zum Zitat Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721 Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721
66.
Zurück zum Zitat Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464CrossRef Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464CrossRef
Metadaten
Titel
Joint embeddings with multimodal cues for video-text retrieval
verfasst von
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
Publikationsdatum
12.01.2019
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 1/2019
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-018-00166-3

Weitere Artikel der Ausgabe 1/2019

International Journal of Multimedia Information Retrieval 1/2019 Zur Ausgabe