nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

verfasst von : Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, Dahua Lin

Erschienen in: Computer Vision – ECCV 2018

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The thriving of video sharing services brings new challenges to video retrieval, e.g. the rapid growth in video duration and content diversity. Meeting such challenges calls for new techniques that can effectively retrieve videos with natural language queries. Existing methods along this line, which mostly rely on embedding videos as a whole, remain far from satisfactory for real-world applications due to the limited expressive power. In this work, we aim to move beyond this limitation by delving into the internal structures of both sides, the queries and the videos. Specifically, we propose a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs. video), but also makes part-level associations, localizing a video clip for each sentence in the query with the help of a focusing guide. These levels are complementary – the top-level matching narrows the search while the part-level localization refines the results. On both ActivityNet Captions and modified LSMDC datasets, the proposed framework achieves remarkable performance gains (Project Page: https://ycxioooong.github.io/projects/fifo).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Unsupervised Image-to-Image Translation with Stacked Cycle-Consistent Adversarial Networks

Nächstes Kapitel Face Super-Resolution Guided by Facial Component Heatmaps

Nur mit Berechtigung zugänglich

The technical details of this scheme is provided in the supplemental materials.

Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6583–6587. IEEE (2014)

Aytar, Y., Shah, M., Luo, J.: Utilizing semantic word similarity measures for video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE (2008)

Bojanowski, P., et al.: Weakly-supervised alignment of video with text. In: IEEE International Conference on Computer Vision (ICCV), pp. 4462–4470 (2015)

Chen, K., Song, H., Loy, C.C., Lin, D.: Discover and learn new objects from documentaries. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1111–1120. IEEE (2017)

Dalton, J., Allan, J., Mirajkar, P.: Zero-shot video retrieval using content and concepts. In: the 22nd ACM International Conference on Information and Knowledge Management (CIKM), pp. 1857–1860. ACM (2013)

Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 (2017)

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems (NIPS), pp. 2121–2129 (2013)

Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2782–2795 (2013)CrossRef

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

10.

Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.: Action localization with tubelets from motion. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

11.

Johnson, J., et al.: Image retrieval using scene graphs. In: IEEE Conference on Computer vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015)

12.

Jouili, S., Tabbone, S.: Hypergraph-based image retrieval for graph-based representation. Pattern Recognit. 45(11), 4054–4068 (2012)CrossRef

13.

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)

14.

Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (NIPS), pp. 1889–1897 (2014)

15.

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

16.

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)

17.

Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (ICCV) (2017)

18.

Lin, D., Fidler, S., Kong, C., Urtasun, R.: Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2657–2664 (2014)

19.

Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3707–3715 (2015)

20.

Otani, M., Nakashima, Y., Rahtu, E., Heikkilä, J., Yokoya, N.: Learning joint representations of videos and sentences with web image search. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 651–667. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_46CrossRef

21.

Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

22.

Ren, M., Kiros, R., Zemel, R.: Image question answering: a visual semantic embedding model and a new dataset. Adv. Neural Inf. Process. Systems (NIPS) 1(2), 5 (2015)

23.

Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. 123(1), 94–120 (2017)CrossRef

24.

Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_1CrossRef

25.

Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058 (2016)

26.

Smoliar, S.W., Zhang, H.: Content based video indexing and retrieval. IEEE Multimed. 1(2), 62–72 (1994)CrossRef

27.

Snoek, C.G., Worring, M.: Concept-based video retrieval. Found. Trends Inf. Retrieval 2(4), 215–322 (2008)CrossRef

28.

Tang, K., Yao, B., Fei-Fei, L., Koller, D.: Combining the right features for complex event recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 2696–2703. IEEE (2013)

29.

Tapaswi, M., Bäuml, M., Stiefelhagen, R.: Aligning plot synopses to videos for story-based retrieval. Int. J. Multimed. Inf. Retrieval 4(1), 3–16 (2015)CrossRef

30.

Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016)

31.

Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. In: International Conference on Representation Learning (ICLR) (2016)

32.

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542 (2015)

33.

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)

34.

Wang, D., Li, X., Li, J., Zhang, B.: The importance of query-concept-mapping for automatic video retrieval. In: the 15th ACM International Conference on Multimedia, pp. 285–288. ACM (2007)

35.

Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2CrossRef

36.

Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013 (2016)

37.

Wu, B., Lang, B., Liu, Y.: GKSH: graph based image retrieval using supervised kernel hashing. In: International Conference on Internet Multimedia Computing and Service, pp. 88–93. ACM (2016)

38.

Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 5, p. 6 (2015)

39.

Yao, L., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision (ICCV), pp. 4507–4515 (2015)

40.

Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016)

41.

Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

42.

Zhang, H.J., Wu, J., Zhong, D., Smoliar, S.W.: An integrated system for content-based video retrieval and browsing. Pattern Recognit. 30(4), 643–658 (1997)CrossRef

43.

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: IEEE International Conference on Computer Vision (ICCV), vol. 8 (2017)

44.

Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)

Titel: Find and Focus: Retrieve and Localize Video Events with Natural Language Queries
verfasst von: Dian Shao
Yu Xiong
Yue Zhao
Qingqiu Huang
Yu Qiao
Dahua Lin
Verlag: Springer International Publishing
Buch: Computer Vision – ECCV 2018
Print ISBN: 978-3-030-01239-7

Electronic ISBN: 978-3-030-01240-3

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-030-01240-3_13

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner