nach oben

Neural Processing Letters

Erschienen in:

24.02.2020

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

verfasst von: Zhou Yu, Yijun Song, Jun Yu, Meng Wang, Qingming Huang

Erschienen in: Neural Processing Letters | Ausgabe 3/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.

Vorheriger Artikel Joint Robust Multi-view Spectral Clustering

Nächster Artikel Convolutional Sparse Coded Dynamic Brain Functional Connectivity

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The strategy we used here is as follows: if a sentence is detected to have more than one verbs, we prefer to choose the verb tagged as ‘VBZ’ (i.e., verb in 3rd person singular present), since the subjects in the queries are usually in the 3rd person singular present (e.g., ‘the person’, ‘she’, or ‘he’). If ‘VBZ’ is not detected, we will choose the last verb as the representative verb since the rest verbs are likely to be used to describe the subject rather than the action.

Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171

Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755

Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:1606.01847

Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219CrossRef

Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275

Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124

Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564

10.

Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574

11.

Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:1610.04325

12.

Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980

13.

Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:1506.06726

14.

Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457

15.

Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24

16.

Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://nlp.stanford.edu/pubs/qi2018universal.pdf

17.

Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

18.

Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36CrossRef

19.

Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834

20.

Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer

21.

Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer

22.

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

23.

Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350

24.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

25.

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

26.

Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI

27.

Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:1804.05113

28.

Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687

29.

Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068CrossRef

30.

Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332CrossRef

31.

Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482CrossRef

32.

Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134

33.

Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848

34.

Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef

35.

Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120

36.

Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:1804.07014

37.

Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432MathSciNetCrossRef

38.

Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689

Titel: Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding
verfasst von: Zhou Yu
Yijun Song
Jun Yu
Meng Wang
Qingming Huang
Publikationsdatum: 24.02.2020
Verlag: Springer US
Erschienen in: Neural Processing Letters / Ausgabe 3/2020
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-020-10205-y

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Arbeitszeit/© granata68 / Fotolia, E-Autos im Fuhrpark: Lohnt sich das noch?/© Petair / stock.adobe.com, Kryptowährungen/© gopixa / Getty Images / iStock, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2020

Output Layer Multiplication for Class Imbalance Problem in Convolutional Neural Networks

Global Stabilization of Memristive Neural Networks with Leakage and Time-Varying Delays Via Quantized Sliding-Mode Controller

Asynchronous – Filtering for Discrete-Time Fuzzy Markov Jump Neural Networks with Unreliable Communication Links

EW-Fisher: A Novel Loss Function for Deep Learning-Based Image Co-Segmentation

Joint Robust Multi-view Spectral Clustering

Pedestrian Detection Based on Light-Weighted Separable Convolution for Advanced Driver Assistance Systems

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.