nach oben

International Journal of Multimedia Information Retrieval

Erschienen in:

12.01.2019 | Regular Paper

Joint embeddings with multimodal cues for video-text retrieval

verfasst von: Niluthpol C. Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

Vorheriger Artikel Editorial for the ICMR 2018 special issue

Nächster Artikel Mining exoticism from visual content with fusion-based deep neural networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255

Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVID

Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900

Aytar Y, Vondrick C, Torralba A (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932

Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197

Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733

Cha M, Gwon Y, Kung H (2015) Multimodal sparse representation learning and applications. arXiv preprint arXiv:1511.06238

Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200

10.

Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669

11.

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

12.

Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838

13.

Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC)

14.

Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29

15.

Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16

16.

Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548CrossRef

17.

Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129

18.

Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRef

19.

Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATH

20.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778

21.

Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22

22.

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefMATH

23.

Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269

24.

Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318

25.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137

26.

Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

27.

Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

28.

Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

29.

Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302

30.

Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446

31.

Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276

32.

Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178

33.

Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867

34.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

35.

Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrieval

36.

Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241

37.

Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570

38.

Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimedia

39.

Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307

40.

Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667

41.

Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602

42.

Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRef

43.

Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72CrossRef

44.

TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054

45.

Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096

46.

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788

47.

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

48.

Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823

49.

Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973

50.

Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124

51.

Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064

52.

Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representations

53.

Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542

54.

Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346

55.

Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419

56.

Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23CrossRef

57.

Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162

58.

Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407

59.

Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013

60.

Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296

61.

Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6

62.

Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450

63.

Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555

64.

Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915

65.

Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721

66.

Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464CrossRef

Titel: Joint embeddings with multimodal cues for video-text retrieval
verfasst von: Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
Publikationsdatum: 12.01.2019
Verlag: Springer London
Erschienen in: International Journal of Multimedia Information Retrieval / Ausgabe 1/2019
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-018-00166-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2019

Multi-view collective tensor decomposition for cross-modal hashing

Automatic visual pattern mining from categorical image dataset

Mining exoticism from visual content with fusion-based deep neural networks

Editorial for the ICMR 2018 special issue