nach oben

International Journal of Multimedia Information Retrieval

Erschienen in:

01.06.2023 | Trends & surveys

Deep learning for video-text retrieval: a review

verfasst von: Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, Yu Liu

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.

Vorheriger Artikel CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Nächster Artikel LG-MLFormer: local and global MLP for image captioning

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ali A, Schwartz I, Hazan T, Wolf L (2022) Video and text matching with conditioned embeddings. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 1565–1574

Amrani E, Ben-Ari R, Rotman D, Bronstein A (2021) Noise estimation using density estimation for self-supervised multimodal learning. Proc AAAI Conf Artif Intell 35:6644–6652

Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307

Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C(2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6836–6846

Bain M, Nagrani A, Varol G, Zisserman A(2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1728–1738

Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef

Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding. In ICML 2:4

Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 961–970

Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A (2018) Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, pp 67–74

10.

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6299–6308

11.

Chen S, Zhao Y, Jin Q, Wu Q (2020) Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10638–10647

12.

Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo

13.

Cheng X, Lin H, Wu X, Yang F, Shen D (2021) Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290

14.

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

15.

Croitoru I, Bogolin S-V, Leordeanu M, Jin H, Zisserman A, Albanie S, Liu Y (2021) Teachtext: crossmodal generalized distillation for text-video retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11583–11593

16.

Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

17.

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

18.

Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef

19.

Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9346–9355

20.

Dong J, Wang Y, Chen X, Qu X, Li X, He Y, Wang X (2022) Reading-strategy inspired visual representation learning for text-to-video retrieval. In: IEEE transactions on circuits and systems for video technology

21.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

22.

Dzabraev M, Kalashnikov M, Komkov S, Petiushko A (2021) Mdmmt: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3354–3363

23.

Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

24.

Fan H, Yang Y (2020) Person tube retrieval via language description. Proc AAAI Conf Artif Intell 34:10754–10761

25.

Fang H, Xiong P, Xu L, Chen Y (2021) Clip2video: mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097

26.

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1933–1941

27.

Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6202–6211

28.

Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 214–219

29.

Gao Z, Liu J, Chen S, Chang D, Zhang H, Yuan J (2021) Clip2tv: an empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610

30.

Ge Y, Ge Y, Liu X, Li D, Shan Y, Qie X, Luo P (2022) Bridgeformer: bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:2201.04850

31.

Ge Y, Ge Y, Liu X, Wang AJ, Wu J, Shan Y, Qie X, Luo P (2022) Miles: visual bert pre-training with injected language semantics for video-text retrieval. arXiv preprint arXiv:2204.12408

32.

Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: cooperative hierarchical transformer for video-text representation learning. Adv Neural Inf Process Syst 33:22605–22618

33.

Gorti SK, Vouitsis N, Ma J, Golestan K, Volkovs M, Garg A, Yu G (2022) X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5006–5015

34.

Gu Y, Ma C, Yang J (2016) Supervised recurrent hashing for large scale video retrieval. In: Proceedings of the 24th ACM international conference on multimedia. pp 272–276

35.

Guo X, Guo X, Lu Y (2021) Ssan: Separable self-attention network for video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12618–12627

36.

Han N, Chen J, Xiao G, Zeng Y, Shi C, Chen H (2021) Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609

37.

Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6546–6555

38.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778

39.

He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738

40.

Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al. (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 131–135

41.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

42.

Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7132–7141

43.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4700–4708

44.

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1725–1732

45.

Korbar B, Petroni F, Girdhar R, Torresani L (2020) Video understanding as machine translation. arXiv preprint arXiv:2006.07203

46.

Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision. pp 706–715

47.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:84–90

48.

Kunitsyn A, Kalashnikov M, Dzabraev M, Ivaniuta A (2022) Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:2203.07086

49.

Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

50.

Lei J, Li L, Zhou L, Gan Z, Berg TL, Bansal M, Liu J (2021) Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 7331–7341

51.

Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200

52.

Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7083–7093

53.

Liu S, Fan H, Qian S, Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11915–11925

54.

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

55.

Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487

56.

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

57.

Lu Y-J, Zhang H, de Boer M, Ngo C-W (2016) Event detection with zero example: Select the right and suppress the wrong concepts. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. pp 127–134

58.

Luo H, Ji L, Shi B, Huang H, Duan N, Li T, Li J, Bharti T, Zhou M (2020) Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353

59.

Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2021) Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860

60.

Ma Y, Xu G, Sun X, Yan M, Zhang J, Ji R (2022) X-clip: end-to-end multi-grained contrastive learning for video-text retrieval. arXiv preprint arXiv:2207.07285

61.

Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516

62.

Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 2630–2640

63.

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

64.

Min S, Kong W, Tu R-C, Gong D, Cai C, Zhao W, Liu C, Zheng S, Wang H, Li Z, et al. (2022) Hunyuan_tvr for text-video retrivial. arXiv preprint arXiv:2204.03382

65.

Mithun N C, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval. pp 19–27

66.

Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision. Springer, pp 651–667

67.

Patrick M, Huang P-Y, Asano Y, Metze F, Hauptmann A, Henriques J, Vedaldi A (2020) Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824

68.

Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision. pp 5533–5541

69.

Qiu Z, Yao T, Ngo C-W, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12056–12065

70.

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763

71.

Rao Y, Zhao W, Liu B, Jiwen L, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949

72.

Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis 123(1):94–120CrossRef

73.

Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108

74.

Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607

75.

Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 815–823

76.

Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris R, Harwath D, Glass J, Kuehne H (2021) Everything at once–multi-modal fusion transformer for video retrieval. arXiv preprint arXiv:2112.04446

77.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

78.

Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1979–1988

79.

Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J(2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530

80.

Sun C, Baradel F, Murphy K, Schmid C(2019) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743

81.

Sun C, Myers A, Vondrick C, Murphy K, Schmid C(2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7464–7473

82.

Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4597–4605

83.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9

84.

Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075

85.

Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490

86.

Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124

87.

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4489–4497

88.

Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

89.

Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M(2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6450–6459

90.

Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

91.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010

92.

Wang J, Chen B, Liao D, Zeng Z, Li G, Xia S-T, Xu J(2022) Hybrid contrastive quantization for efficient cross-view video retrieval. arXiv preprint arXiv:2202.03384

93.

Wang L, Li Y, Lazebnik S(2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5005–5013

94.

Wang Q, Zhang Y, Zheng Y, Pan P, Hua X-S. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022

95.

Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5079–5088

96.

Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4581–4591

97.

Wang Y, Dong J, Liang T, Zhang M, Cai R, Wang X (2022) Cross-lingual cross-modal retrieval with noise-robust learning. arXiv preprint arXiv:2208.12526

98.

Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 450–459

99.

Wu G, Jungong H, Yuchen G, Li L, Guiguang D, Qiang N, Ling S (2018) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 28(4):1993–2007MathSciNetCrossRef

100.

Wu P, He X, Tang M, Lv Y, Liu J (2021) Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on multimedia. pp 3518–3527

101.

Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV). pp 305–321

102.

Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5288–5296

103.

Yang X, Dong J, Cao Y, Wang X, Wang M, Chua T-S (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp 1339–1348

104.

Yao T, Li X (2018) Yh technologies at activitynet challenge 2018. arXiv preprint arXiv:1807.00686

105.

Yu Y, Ko H, Choi J, Kim G (2016) Video captioning and retrieval models with semantic attention. arXiv preprint arXiv:1610.02947, 6(7)

106.

Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European conference on computer vision (ECCV). pp 471–487

107.

Zhai A, Wu H-Y (2018) Classification is a strong baseline for deep metric learning. arXiv preprint arXiv:1811.12649

108.

Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV). pp 374–390

109.

Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 13577–13587

110.

Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820

111.

Zhao S, Zhu L, Wang X, Yang Y (2022) Centerclip: token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823

112.

Zhong Y, Arandjelović R, Zisserman A (2018) Ghostvlad for set-based face recognition. In: Asian conference on computer vision. Springer, pp 35–50

113.

Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Machine Intell 40(6):1452–1464CrossRef

114.

Zhu L, Yang Y (2020) Actbert: Learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 8746–8755

115.

Zhuo Y, Li Y, Hsiao J, Ho C, Li B (2022) Clip4hashing: Unsupervised deep hashing for cross-modal video-text retrieval. In: Proceedings of the 2022 international conference on multimedia retrieval. pp 158–166

Titel: Deep learning for video-text retrieval: a review
verfasst von: Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
Publikationsdatum: 01.06.2023
Verlag: Springer London
Erschienen in: International Journal of Multimedia Information Retrieval / Ausgabe 1/2023
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-023-00267-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2023

MemeTector: enforcing deep focus for meme detection

Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval

End-to-end residual learning-based deep neural network model deployment for human activity recognition

Nested-Net: a deep nested network for background subtraction

Optical music recognition for homophonic scores with neural networks and synthetic music generation

A deep image retrieval network using Max-m-Min pooling and morphological feature generating residual blocks

Premium Partner