Skip to main content
Erschienen in: Neural Processing Letters 3/2020

24.02.2020

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

verfasst von: Zhou Yu, Yijun Song, Jun Yu, Meng Wang, Qingming Huang

Erschienen in: Neural Processing Letters | Ausgabe 3/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The strategy we used here is as follows: if a sentence is detected to have more than one verbs, we prefer to choose the verb tagged as ‘VBZ’ (i.e., verb in 3rd person singular present), since the subjects in the queries are usually in the 3rd person singular present (e.g., ‘the person’, ‘she’, or ‘he’). If ‘VBZ’ is not detected, we will choose the last verb as the representative verb since the rest verbs are likely to be used to describe the subject rather than the action.
 
Literatur
1.
Zurück zum Zitat Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171 Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171
2.
Zurück zum Zitat Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755
3.
Zurück zum Zitat Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:1606.01847 Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:​1606.​01847
4.
Zurück zum Zitat Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219CrossRef Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219CrossRef
5.
Zurück zum Zitat Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275 Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275
6.
Zurück zum Zitat Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636 Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636
7.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
8.
Zurück zum Zitat Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124
9.
Zurück zum Zitat Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564 Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564
10.
Zurück zum Zitat Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574 Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574
11.
Zurück zum Zitat Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:1610.04325 Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:​1610.​04325
13.
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:1506.06726 Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:​1506.​06726
14.
Zurück zum Zitat Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457 Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457
15.
Zurück zum Zitat Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24 Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24
16.
Zurück zum Zitat Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://nlp.stanford.edu/pubs/qi2018universal.pdf Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://​nlp.​stanford.​edu/​pubs/​qi2018universal.​pdf
17.
Zurück zum Zitat Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541 Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
18.
Zurück zum Zitat Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36CrossRef Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36CrossRef
19.
Zurück zum Zitat Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834 Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834
20.
Zurück zum Zitat Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer
21.
Zurück zum Zitat Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer
22.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
23.
Zurück zum Zitat Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350 Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350
24.
Zurück zum Zitat Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826 Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
25.
Zurück zum Zitat Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
26.
Zurück zum Zitat Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI
27.
Zurück zum Zitat Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:1804.05113 Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:​1804.​05113
28.
Zurück zum Zitat Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687 Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687
29.
Zurück zum Zitat Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068CrossRef Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068CrossRef
30.
Zurück zum Zitat Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332CrossRef Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332CrossRef
32.
Zurück zum Zitat Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134 Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134
33.
Zurück zum Zitat Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848 Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848
34.
Zurück zum Zitat Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef
35.
Zurück zum Zitat Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120 Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120
36.
Zurück zum Zitat Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:1804.07014 Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:​1804.​07014
37.
Zurück zum Zitat Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432MathSciNetCrossRef Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432MathSciNetCrossRef
38.
Zurück zum Zitat Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689 Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689
Metadaten
Titel
Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding
verfasst von
Zhou Yu
Yijun Song
Jun Yu
Meng Wang
Qingming Huang
Publikationsdatum
24.02.2020
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 3/2020
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-020-10205-y

Weitere Artikel der Ausgabe 3/2020

Neural Processing Letters 3/2020 Zur Ausgabe

Neuer Inhalt