Skip to main content
Erschienen in: International Journal of Computer Vision 10/2019

18.06.2019

Video Question Answering with Spatio-Temporal Reasoning

verfasst von: Yunseok Jang, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim, Gunhee Kim

Erschienen in: International Journal of Computer Vision | Ausgabe 10/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR
Zurück zum Zitat Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACL Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACL
Zurück zum Zitat Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPR Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPR
Zurück zum Zitat Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCV Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCV
Zurück zum Zitat Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLR Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLR
Zurück zum Zitat Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHI Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHI
Zurück zum Zitat Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics Club Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics Club
Zurück zum Zitat Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-Semantics Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-Semantics
Zurück zum Zitat Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPR Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPR
Zurück zum Zitat Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.CrossRef Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.CrossRef
Zurück zum Zitat Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLP Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLP
Zurück zum Zitat Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIA Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIA
Zurück zum Zitat Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP
Zurück zum Zitat Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPR Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPR
Zurück zum Zitat Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS
Zurück zum Zitat Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR
Zurück zum Zitat Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPR Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPR
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR
Zurück zum Zitat Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPR Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPR
Zurück zum Zitat Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR
Zurück zum Zitat Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPR Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPR
Zurück zum Zitat Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAI Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAI
Zurück zum Zitat Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPS Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPS
Zurück zum Zitat Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLR Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLR
Zurück zum Zitat Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CIS Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CIS
Zurück zum Zitat Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPS Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPS
Zurück zum Zitat Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLP Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLP
Zurück zum Zitat Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMI Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMI
Zurück zum Zitat Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCV Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCV
Zurück zum Zitat Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPR Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPR
Zurück zum Zitat Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPR Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPR
Zurück zum Zitat Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCV Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCV
Zurück zum Zitat Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR
Zurück zum Zitat Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS
Zurück zum Zitat Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCV Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCV
Zurück zum Zitat Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACL Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACL
Zurück zum Zitat Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCV Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCV
Zurück zum Zitat Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCV Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCV
Zurück zum Zitat Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACL Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACL
Zurück zum Zitat Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLP Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLP
Zurück zum Zitat Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHR Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHR
Zurück zum Zitat Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACL Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACL
Zurück zum Zitat Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPS Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPS
Zurück zum Zitat Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.CrossRef Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.CrossRef
Zurück zum Zitat Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.MathSciNetCrossRef Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.MathSciNetCrossRef
Zurück zum Zitat Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01 Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01
Zurück zum Zitat Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICML Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICML
Zurück zum Zitat Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPS Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPS
Zurück zum Zitat Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPR Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPR
Zurück zum Zitat Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCV Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCV
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPS Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPS
Zurück zum Zitat Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCV Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCV
Zurück zum Zitat Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPR Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPR
Zurück zum Zitat Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCV Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCV
Zurück zum Zitat Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICML Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICML
Zurück zum Zitat Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPR Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPR
Zurück zum Zitat You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPR You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPR
Zurück zum Zitat Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR
Zurück zum Zitat Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCV Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCV
Zurück zum Zitat Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI
Zurück zum Zitat Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPS Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPS
Zurück zum Zitat Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.MathSciNetCrossRef Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.MathSciNetCrossRef
Zurück zum Zitat Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPR Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPR
Zurück zum Zitat Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV
Metadaten
Titel
Video Question Answering with Spatio-Temporal Reasoning
verfasst von
Yunseok Jang
Yale Song
Chris Dongjoo Kim
Youngjae Yu
Youngjin Kim
Gunhee Kim
Publikationsdatum
18.06.2019
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 10/2019
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-019-01189-x

Weitere Artikel der Ausgabe 10/2019

International Journal of Computer Vision 10/2019 Zur Ausgabe

Premium Partner