Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 1/2023

01.06.2023 | Trends & surveys

Deep learning for video-text retrieval: a review

verfasst von: Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, Yu Liu

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ali A, Schwartz I, Hazan T, Wolf L (2022) Video and text matching with conditioned embeddings. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 1565–1574 Ali A, Schwartz I, Hazan T, Wolf L (2022) Video and text matching with conditioned embeddings. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 1565–1574
2.
Zurück zum Zitat Amrani E, Ben-Ari R, Rotman D, Bronstein A (2021) Noise estimation using density estimation for self-supervised multimodal learning. Proc AAAI Conf Artif Intell 35:6644–6652 Amrani E, Ben-Ari R, Rotman D, Bronstein A (2021) Noise estimation using density estimation for self-supervised multimodal learning. Proc AAAI Conf Artif Intell 35:6644–6652
3.
Zurück zum Zitat Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307 Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307
4.
Zurück zum Zitat Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C(2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6836–6846 Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C(2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6836–6846
5.
Zurück zum Zitat Bain M, Nagrani A, Varol G, Zisserman A(2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1728–1738 Bain M, Nagrani A, Varol G, Zisserman A(2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1728–1738
6.
Zurück zum Zitat Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef
7.
Zurück zum Zitat Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding. In ICML 2:4 Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding. In ICML 2:4
8.
Zurück zum Zitat Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 961–970 Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 961–970
9.
Zurück zum Zitat Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A (2018) Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, pp 67–74 Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A (2018) Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, pp 67–74
10.
Zurück zum Zitat Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6299–6308 Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6299–6308
11.
Zurück zum Zitat Chen S, Zhao Y, Jin Q, Wu Q (2020) Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10638–10647 Chen S, Zhao Y, Jin Q, Wu Q (2020) Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10638–10647
12.
Zurück zum Zitat Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo
13.
Zurück zum Zitat Cheng X, Lin H, Wu X, Yang F, Shen D (2021) Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 Cheng X, Lin H, Wu X, Yang F, Shen D (2021) Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:​2109.​04290
14.
Zurück zum Zitat Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:​1412.​3555
15.
Zurück zum Zitat Croitoru I, Bogolin S-V, Leordeanu M, Jin H, Zisserman A, Albanie S, Liu Y (2021) Teachtext: crossmodal generalized distillation for text-video retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11583–11593 Croitoru I, Bogolin S-V, Leordeanu M, Jin H, Zisserman A, Albanie S, Liu Y (2021) Teachtext: crossmodal generalized distillation for text-video retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11583–11593
16.
Zurück zum Zitat Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 32 Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
17.
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
18.
Zurück zum Zitat Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef
19.
Zurück zum Zitat Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9346–9355 Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9346–9355
20.
Zurück zum Zitat Dong J, Wang Y, Chen X, Qu X, Li X, He Y, Wang X (2022) Reading-strategy inspired visual representation learning for text-to-video retrieval. In: IEEE transactions on circuits and systems for video technology Dong J, Wang Y, Chen X, Qu X, Li X, He Y, Wang X (2022) Reading-strategy inspired visual representation learning for text-to-video retrieval. In: IEEE transactions on circuits and systems for video technology
21.
Zurück zum Zitat Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929
22.
Zurück zum Zitat Dzabraev M, Kalashnikov M, Komkov S, Petiushko A (2021) Mdmmt: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3354–3363 Dzabraev M, Kalashnikov M, Komkov S, Petiushko A (2021) Mdmmt: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3354–3363
23.
Zurück zum Zitat Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:​1707.​05612
24.
Zurück zum Zitat Fan H, Yang Y (2020) Person tube retrieval via language description. Proc AAAI Conf Artif Intell 34:10754–10761 Fan H, Yang Y (2020) Person tube retrieval via language description. Proc AAAI Conf Artif Intell 34:10754–10761
25.
26.
Zurück zum Zitat Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1933–1941 Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1933–1941
27.
Zurück zum Zitat Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6202–6211 Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6202–6211
28.
Zurück zum Zitat Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 214–219 Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 214–219
29.
Zurück zum Zitat Gao Z, Liu J, Chen S, Chang D, Zhang H, Yuan J (2021) Clip2tv: an empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610 Gao Z, Liu J, Chen S, Chang D, Zhang H, Yuan J (2021) Clip2tv: an empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:​2111.​05610
30.
Zurück zum Zitat Ge Y, Ge Y, Liu X, Li D, Shan Y, Qie X, Luo P (2022) Bridgeformer: bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:2201.04850 Ge Y, Ge Y, Liu X, Li D, Shan Y, Qie X, Luo P (2022) Bridgeformer: bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:​2201.​04850
31.
Zurück zum Zitat Ge Y, Ge Y, Liu X, Wang AJ, Wu J, Shan Y, Qie X, Luo P (2022) Miles: visual bert pre-training with injected language semantics for video-text retrieval. arXiv preprint arXiv:2204.12408 Ge Y, Ge Y, Liu X, Wang AJ, Wu J, Shan Y, Qie X, Luo P (2022) Miles: visual bert pre-training with injected language semantics for video-text retrieval. arXiv preprint arXiv:​2204.​12408
32.
Zurück zum Zitat Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: cooperative hierarchical transformer for video-text representation learning. Adv Neural Inf Process Syst 33:22605–22618 Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: cooperative hierarchical transformer for video-text representation learning. Adv Neural Inf Process Syst 33:22605–22618
33.
Zurück zum Zitat Gorti SK, Vouitsis N, Ma J, Golestan K, Volkovs M, Garg A, Yu G (2022) X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5006–5015 Gorti SK, Vouitsis N, Ma J, Golestan K, Volkovs M, Garg A, Yu G (2022) X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5006–5015
34.
Zurück zum Zitat Gu Y, Ma C, Yang J (2016) Supervised recurrent hashing for large scale video retrieval. In: Proceedings of the 24th ACM international conference on multimedia. pp 272–276 Gu Y, Ma C, Yang J (2016) Supervised recurrent hashing for large scale video retrieval. In: Proceedings of the 24th ACM international conference on multimedia. pp 272–276
35.
Zurück zum Zitat Guo X, Guo X, Lu Y (2021) Ssan: Separable self-attention network for video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12618–12627 Guo X, Guo X, Lu Y (2021) Ssan: Separable self-attention network for video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12618–12627
36.
Zurück zum Zitat Han N, Chen J, Xiao G, Zeng Y, Shi C, Chen H (2021) Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 Han N, Chen J, Xiao G, Zeng Y, Shi C, Chen H (2021) Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:​2110.​15609
37.
Zurück zum Zitat Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6546–6555 Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6546–6555
38.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
39.
Zurück zum Zitat He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738 He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738
40.
Zurück zum Zitat Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al. (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 131–135 Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al. (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 131–135
41.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
42.
Zurück zum Zitat Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7132–7141 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7132–7141
43.
Zurück zum Zitat Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4700–4708 Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4700–4708
44.
Zurück zum Zitat Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1725–1732 Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1725–1732
45.
46.
Zurück zum Zitat Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision. pp 706–715 Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision. pp 706–715
47.
Zurück zum Zitat Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:84–90 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:84–90
48.
Zurück zum Zitat Kunitsyn A, Kalashnikov M, Dzabraev M, Ivaniuta A (2022) Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:2203.07086 Kunitsyn A, Kalashnikov M, Dzabraev M, Ivaniuta A (2022) Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:​2203.​07086
49.
Zurück zum Zitat Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:​1909.​11942
50.
Zurück zum Zitat Lei J, Li L, Zhou L, Gan Z, Berg TL, Bansal M, Liu J (2021) Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 7331–7341 Lei J, Li L, Zhou L, Gan Z, Berg TL, Bansal M, Liu J (2021) Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 7331–7341
51.
Zurück zum Zitat Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200 Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:​2005.​00200
52.
Zurück zum Zitat Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7083–7093 Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7083–7093
53.
Zurück zum Zitat Liu S, Fan H, Qian S, Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11915–11925 Liu S, Fan H, Qian S, Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 11915–11925
54.
Zurück zum Zitat Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37 Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
55.
Zurück zum Zitat Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:​1907.​13487
56.
Zurück zum Zitat Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:​1907.​11692
57.
Zurück zum Zitat Lu Y-J, Zhang H, de Boer M, Ngo C-W (2016) Event detection with zero example: Select the right and suppress the wrong concepts. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. pp 127–134 Lu Y-J, Zhang H, de Boer M, Ngo C-W (2016) Event detection with zero example: Select the right and suppress the wrong concepts. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. pp 127–134
58.
Zurück zum Zitat Luo H, Ji L, Shi B, Huang H, Duan N, Li T, Li J, Bharti T, Zhou M (2020) Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 Luo H, Ji L, Shi B, Huang H, Duan N, Li T, Li J, Bharti T, Zhou M (2020) Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:​2002.​06353
59.
Zurück zum Zitat Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2021) Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2021) Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:​2104.​08860
60.
Zurück zum Zitat Ma Y, Xu G, Sun X, Yan M, Zhang J, Ji R (2022) X-clip: end-to-end multi-grained contrastive learning for video-text retrieval. arXiv preprint arXiv:2207.07285 Ma Y, Xu G, Sun X, Yan M, Zhang J, Ji R (2022) X-clip: end-to-end multi-grained contrastive learning for video-text retrieval. arXiv preprint arXiv:​2207.​07285
61.
Zurück zum Zitat Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:​1804.​02516
62.
Zurück zum Zitat Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 2630–2640 Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 2630–2640
63.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
64.
Zurück zum Zitat Min S, Kong W, Tu R-C, Gong D, Cai C, Zhao W, Liu C, Zheng S, Wang H, Li Z, et al. (2022) Hunyuan_tvr for text-video retrivial. arXiv preprint arXiv:2204.03382 Min S, Kong W, Tu R-C, Gong D, Cai C, Zhao W, Liu C, Zheng S, Wang H, Li Z, et al. (2022) Hunyuan_tvr for text-video retrivial. arXiv preprint arXiv:​2204.​03382
65.
Zurück zum Zitat Mithun N C, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval. pp 19–27 Mithun N C, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval. pp 19–27
66.
Zurück zum Zitat Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision. Springer, pp 651–667 Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision. Springer, pp 651–667
67.
Zurück zum Zitat Patrick M, Huang P-Y, Asano Y, Metze F, Hauptmann A, Henriques J, Vedaldi A (2020) Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824 Patrick M, Huang P-Y, Asano Y, Metze F, Hauptmann A, Henriques J, Vedaldi A (2020) Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:​2010.​02824
68.
Zurück zum Zitat Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision. pp 5533–5541 Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision. pp 5533–5541
69.
Zurück zum Zitat Qiu Z, Yao T, Ngo C-W, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12056–12065 Qiu Z, Yao T, Ngo C-W, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12056–12065
70.
Zurück zum Zitat Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763 Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
71.
Zurück zum Zitat Rao Y, Zhao W, Liu B, Jiwen L, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949 Rao Y, Zhao W, Liu B, Jiwen L, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949
72.
Zurück zum Zitat Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis 123(1):94–120CrossRef Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis 123(1):94–120CrossRef
73.
Zurück zum Zitat Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108
74.
Zurück zum Zitat Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607 Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607
75.
Zurück zum Zitat Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 815–823 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 815–823
76.
Zurück zum Zitat Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris R, Harwath D, Glass J, Kuehne H (2021) Everything at once–multi-modal fusion transformer for video retrieval. arXiv preprint arXiv:2112.04446 Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris R, Harwath D, Glass J, Kuehne H (2021) Everything at once–multi-modal fusion transformer for video retrieval. arXiv preprint arXiv:​2112.​04446
77.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409.​1556
78.
Zurück zum Zitat Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1979–1988 Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1979–1988
79.
Zurück zum Zitat Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J(2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J(2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:​1908.​08530
80.
Zurück zum Zitat Sun C, Baradel F, Murphy K, Schmid C(2019) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 Sun C, Baradel F, Murphy K, Schmid C(2019) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:​1906.​05743
81.
Zurück zum Zitat Sun C, Myers A, Vondrick C, Murphy K, Schmid C(2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7464–7473 Sun C, Myers A, Vondrick C, Murphy K, Schmid C(2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7464–7473
82.
Zurück zum Zitat Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4597–4605 Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4597–4605
83.
Zurück zum Zitat Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9
84.
Zurück zum Zitat Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:​1503.​00075
85.
Zurück zum Zitat Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:​1908.​07490
86.
Zurück zum Zitat Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:​1609.​08124
87.
Zurück zum Zitat Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4489–4497 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 4489–4497
88.
Zurück zum Zitat Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:​1708.​05038
89.
Zurück zum Zitat Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M(2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6450–6459 Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M(2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6450–6459
90.
Zurück zum Zitat Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 Van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:​1807.​03748
91.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010
92.
Zurück zum Zitat Wang J, Chen B, Liao D, Zeng Z, Li G, Xia S-T, Xu J(2022) Hybrid contrastive quantization for efficient cross-view video retrieval. arXiv preprint arXiv:2202.03384 Wang J, Chen B, Liao D, Zeng Z, Li G, Xia S-T, Xu J(2022) Hybrid contrastive quantization for efficient cross-view video retrieval. arXiv preprint arXiv:​2202.​03384
93.
Zurück zum Zitat Wang L, Li Y, Lazebnik S(2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5005–5013 Wang L, Li Y, Lazebnik S(2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5005–5013
94.
Zurück zum Zitat Wang Q, Zhang Y, Zheng Y, Pan P, Hua X-S. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022 Wang Q, Zhang Y, Zheng Y, Pan P, Hua X-S. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:​2203.​07111, 2022
95.
Zurück zum Zitat Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5079–5088 Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5079–5088
96.
Zurück zum Zitat Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4581–4591 Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4581–4591
97.
Zurück zum Zitat Wang Y, Dong J, Liang T, Zhang M, Cai R, Wang X (2022) Cross-lingual cross-modal retrieval with noise-robust learning. arXiv preprint arXiv:2208.12526 Wang Y, Dong J, Liang T, Zhang M, Cai R, Wang X (2022) Cross-lingual cross-modal retrieval with noise-robust learning. arXiv preprint arXiv:​2208.​12526
98.
Zurück zum Zitat Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 450–459 Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 450–459
99.
Zurück zum Zitat Wu G, Jungong H, Yuchen G, Li L, Guiguang D, Qiang N, Ling S (2018) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 28(4):1993–2007MathSciNetCrossRef Wu G, Jungong H, Yuchen G, Li L, Guiguang D, Qiang N, Ling S (2018) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 28(4):1993–2007MathSciNetCrossRef
100.
Zurück zum Zitat Wu P, He X, Tang M, Lv Y, Liu J (2021) Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on multimedia. pp 3518–3527 Wu P, He X, Tang M, Lv Y, Liu J (2021) Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on multimedia. pp 3518–3527
101.
Zurück zum Zitat Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV). pp 305–321 Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV). pp 305–321
102.
Zurück zum Zitat Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5288–5296 Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5288–5296
103.
Zurück zum Zitat Yang X, Dong J, Cao Y, Wang X, Wang M, Chua T-S (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp 1339–1348 Yang X, Dong J, Cao Y, Wang X, Wang M, Chua T-S (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp 1339–1348
105.
Zurück zum Zitat Yu Y, Ko H, Choi J, Kim G (2016) Video captioning and retrieval models with semantic attention. arXiv preprint arXiv:1610.02947, 6(7) Yu Y, Ko H, Choi J, Kim G (2016) Video captioning and retrieval models with semantic attention. arXiv preprint arXiv:​1610.​02947, 6(7)
106.
Zurück zum Zitat Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European conference on computer vision (ECCV). pp 471–487 Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European conference on computer vision (ECCV). pp 471–487
107.
108.
Zurück zum Zitat Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV). pp 374–390 Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV). pp 374–390
109.
Zurück zum Zitat Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 13577–13587 Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 13577–13587
110.
Zurück zum Zitat Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:​1510.​03820
111.
Zurück zum Zitat Zhao S, Zhu L, Wang X, Yang Y (2022) Centerclip: token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823 Zhao S, Zhu L, Wang X, Yang Y (2022) Centerclip: token clustering for efficient text-video retrieval. arXiv preprint arXiv:​2205.​00823
112.
Zurück zum Zitat Zhong Y, Arandjelović R, Zisserman A (2018) Ghostvlad for set-based face recognition. In: Asian conference on computer vision. Springer, pp 35–50 Zhong Y, Arandjelović R, Zisserman A (2018) Ghostvlad for set-based face recognition. In: Asian conference on computer vision. Springer, pp 35–50
113.
Zurück zum Zitat Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Machine Intell 40(6):1452–1464CrossRef Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Machine Intell 40(6):1452–1464CrossRef
114.
Zurück zum Zitat Zhu L, Yang Y (2020) Actbert: Learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 8746–8755 Zhu L, Yang Y (2020) Actbert: Learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 8746–8755
115.
Zurück zum Zitat Zhuo Y, Li Y, Hsiao J, Ho C, Li B (2022) Clip4hashing: Unsupervised deep hashing for cross-modal video-text retrieval. In: Proceedings of the 2022 international conference on multimedia retrieval. pp 158–166 Zhuo Y, Li Y, Hsiao J, Ho C, Li B (2022) Clip4hashing: Unsupervised deep hashing for cross-modal video-text retrieval. In: Proceedings of the 2022 international conference on multimedia retrieval. pp 158–166
Metadaten
Titel
Deep learning for video-text retrieval: a review
verfasst von
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
Publikationsdatum
01.06.2023
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 1/2023
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-023-00267-8

Weitere Artikel der Ausgabe 1/2023

International Journal of Multimedia Information Retrieval 1/2023 Zur Ausgabe

Premium Partner