Skip to main content
Top

2025 | OriginalPaper | Chapter

Vamos: Versatile Action Models for Video Understanding

Authors : Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

What makes good representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as general-purpose video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularity. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the “reasoner”, and can flexibly leverage visual embedding and free-form text descriptions as its input. To interpret the important text evidence for question answering, we generalize the concept bottleneck model to work with tokens and nonlinear models, which uses hard attention to select a small subset of tokens from the free-form text as inputs to the LLM reasoner. We evaluate Vamos on five complementary benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We also demonstrate that our token bottleneck model is able to select relevant evidence from free-form text, support test-time intervention, and achieves nearly 5 times inference speedup while keeping a competitive question answering performance. Code and models are publicly released at https://​brown-palm.​github.​io/​Vamos/​.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Aein, M.J., Aksoy, E.E., Wörgötter, F.: Library of actions: implementing a generic robot execution framework by using manipulation action semantics. Int. J. Robot. Res. (2019) Aein, M.J., Aksoy, E.E., Wörgötter, F.: Library of actions: implementing a generic robot execution framework by using manipulation action semantics. Int. J. Robot. Res. (2019)
2.
go back to reference Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022) Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
4.
go back to reference Boix-Adserà, E., Saremi, O., Abbe, E., Bengio, S., Littwin, E., Susskind, J.: When can transformers reason with abstract symbols? arXiv preprint arXiv:2310.09753 (2023) Boix-Adserà, E., Saremi, O., Abbe, E., Bengio, S., Littwin, E., Susskind, J.: When can transformers reason with abstract symbols? arXiv preprint arXiv:​2310.​09753 (2023)
5.
go back to reference Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: CVPR (2022) Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: CVPR (2022)
6.
go back to reference Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015) Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
7.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
9.
go back to reference Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: CVPR (2022) Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: CVPR (2022)
10.
go back to reference Chen, L.H., et al.: Uniter: universal image-text representation learning. In: ECCV (2020) Chen, L.H., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)
11.
go back to reference Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018) Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
13.
go back to reference Ding, D., Hill, F., Santoro, A., Reynolds, M., Botvinick, M.: Attention over learned object embeddings enables complex visual reasoning. In: NeurIPS (2021) Ding, D., Hill, F., Santoro, A., Reynolds, M., Botvinick, M.: Attention over learned object embeddings enables complex visual reasoning. In: NeurIPS (2021)
14.
go back to reference Epstein, D., Wu, J., Schmid, C., Sun, C.: Learning temporal dynamics from cycles in narrated video. In: ICCV (2021) Epstein, D., Wu, J., Schmid, C., Sun, C.: Learning temporal dynamics from cycles in narrated video. In: ICCV (2021)
15.
go back to reference Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)
16.
go back to reference Fu, T.J., et al.: Violet: end-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681 (2021) Fu, T.J., et al.: Violet: end-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:​2111.​12681 (2021)
17.
go back to reference Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR (2022) Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: CVPR (2022)
18.
go back to reference Gruver, N., Finzi, M., Qiu, S., Wilson, A.G.: Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820 (2023) Gruver, N., Finzi, M., Qiu, S., Wilson, A.G.: Large language models are zero-shot time series forecasters. arXiv preprint arXiv:​2310.​07820 (2023)
19.
go back to reference Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018) Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)
20.
go back to reference Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: CVPR (2023) Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: CVPR (2023)
21.
go back to reference Hongeng, S., Nevatia, R., Bremond, F.: Video-based event recognition: activity representation and probabilistic recognition methods. In: Computer Vision and Image Understanding (2004) Hongeng, S., Nevatia, R., Bremond, F.: Video-based event recognition: activity representation and probabilistic recognition methods. In: Computer Vision and Image Understanding (2004)
23.
24.
go back to reference Huang, D., Hilliges, O., Van Gool, L., Wang, X.: Palm: predicting actions through language models@ ego4d long-term action anticipation challenge. arXiv preprint arXiv:2306.16545 (2023) Huang, D., Hilliges, O., Van Gool, L., Wang, X.: Palm: predicting actions through language models@ ego4d long-term action anticipation challenge. arXiv preprint arXiv:​2306.​16545 (2023)
25.
go back to reference Ishibashi, T., Ono, K., Kugo, N., Sato, Y.: Technical report for ego4d long term action anticipation challenge. arXiv preprint arXiv:2307.01467 (2023) Ishibashi, T., Ono, K., Kugo, N., Sato, Y.: Technical report for ego4d long term action anticipation challenge. arXiv preprint arXiv:​2307.​01467 (2023)
26.
go back to reference Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. Pattern Anal. Mach. Intell. (2000) Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. Pattern Anal. Mach. Intell. (2000)
28.
go back to reference Jayaraman, D., Ebert, F., Efros, A.A., Levine, S.: Time-agnostic prediction: predicting predictable video frames. arXiv preprint arXiv:1808.07784 (2018) Jayaraman, D., Ebert, F., Efros, A.A., Levine, S.: Time-agnostic prediction: predicting predictable video frames. arXiv preprint arXiv:​1808.​07784 (2018)
29.
go back to reference Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021) Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
30.
go back to reference Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020) Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)
31.
go back to reference Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retrieval (2013) Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retrieval (2013)
32.
go back to reference Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K.: Action-GPT: leveraging large-scale language models for improved and generalized zero shot action generation. arXiv preprint arXiv:2211.15603 (2022) Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K.: Action-GPT: leveraging large-scale language models for improved and generalized zero shot action generation. arXiv preprint arXiv:​2211.​15603 (2022)
33.
go back to reference Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007) Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)
34.
go back to reference Kıcıman, E., Ness, R., Sharma, A., Tan, C.: Causal reasoning and large language models: opening a new frontier for causality. arXiv preprint arXiv:2305.00050 (2023) Kıcıman, E., Ness, R., Sharma, A., Tan, C.: Causal reasoning and large language models: opening a new frontier for causality. arXiv preprint arXiv:​2305.​00050 (2023)
35.
go back to reference Ko, D., Lee, J.S., Kang, W., Roh, B., Kim, H.J.: Large language models are temporal and causal reasoners for video question answering. In: EMNLP (2023) Ko, D., Lee, J.S., Kang, W., Roh, B., Kim, H.J.: Large language models are temporal and causal reasoners for video question answering. In: EMNLP (2023)
36.
go back to reference Koh, P.W., et al.: Concept bottleneck models. In: ICML (2020) Koh, P.W., et al.: Concept bottleneck models. In: ICML (2020)
37.
go back to reference Krishnan, R.M., Tang, Z., Yu, Z., Sun, C.: Spacewalk-18: a benchmark for multimodal and long-form procedural video understanding in novel domains. arXiv preprint arXiv:2311.18773 (2023) Krishnan, R.M., Tang, Z., Yu, Z., Sun, C.: Spacewalk-18: a benchmark for multimodal and long-form procedural video understanding in novel domains. arXiv preprint arXiv:​2311.​18773 (2023)
38.
go back to reference Lester, J., Choudhury, T., Kern, N., Borriello, G., Hannaford, B.: A hybrid discriminative/generative approach for modeling human activities. In: IJCAI (2005) Lester, J., Choudhury, T., Kern, N., Borriello, G., Hannaford, B.: A hybrid discriminative/generative approach for modeling human activities. In: IJCAI (2005)
39.
go back to reference Li, J., Wei, P., Han, W., Fan, L.: IntentQA: context-aware video intent reasoning. In: ICCV (2023) Li, J., Wei, P., Han, W., Fan, L.: IntentQA: context-aware video intent reasoning. In: ICCV (2023)
40.
go back to reference Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:​2301.​12597 (2023)
41.
go back to reference Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023) Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
42.
43.
go back to reference Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019) Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:​1908.​03557 (2019)
44.
go back to reference Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: ICCV (2021) Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: ICCV (2021)
45.
go back to reference Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020) Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)
47.
go back to reference Lin, K.Q., et al.: Egocentric video-language pretraining. In: NeurIPS (2022) Lin, K.Q., et al.: Egocentric video-language pretraining. In: NeurIPS (2022)
49.
go back to reference Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: ECCV (2020) Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: ECCV (2020)
50.
go back to reference Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: a diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126 (2023) Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: a diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:​2308.​09126 (2023)
51.
go back to reference Min, J., Buch, S., Nagrani, A., Cho, M., Schmid, C.: MoReVQA: exploring modular reasoning models for video question answering. arXiv preprint arXiv:2404.06511 (2024) Min, J., Buch, S., Nagrani, A., Cho, M., Schmid, C.: MoReVQA: exploring modular reasoning models for video question answering. arXiv preprint arXiv:​2404.​06511 (2024)
52.
go back to reference Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K.P., Lee, H.: Unsupervised learning of object structure and dynamics from videos. In: NeurIPS (2019) Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K.P., Lee, H.: Unsupervised learning of object structure and dynamics from videos. In: NeurIPS (2019)
54.
55.
go back to reference Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event representation. In: CVPR Workshop (2004) Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event representation. In: CVPR Workshop (2004)
57.
go back to reference Pastra, K., Aloimonos, Y.: The minimalist grammar of action. Philos. Trans. R. Soc. B: Biol. Sci. (2012) Pastra, K., Aloimonos, Y.: The minimalist grammar of action. Philos. Trans. R. Soc. B: Biol. Sci. (2012)
58.
go back to reference Pei, M., Jia, Y., Zhu, S.C.: Parsing video events with goal inference and intent prediction. In: ICCV (2011) Pei, M., Jia, Y., Zhu, S.C.: Parsing video events with goal inference and intent prediction. In: ICCV (2011)
59.
60.
go back to reference Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: CVPR (2012) Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: CVPR (2012)
62.
go back to reference Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983 (2023) Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983 (2023)
63.
go back to reference Singh, A., et al.: Flava: a foundational language and vision alignment model. In: CVPR (2022) Singh, A., et al.: Flava: a foundational language and vision alignment model. In: CVPR (2022)
64.
go back to reference Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019) Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)
65.
go back to reference Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via Python execution for reasoning. In: ICCV (2023) Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via Python execution for reasoning. In: ICCV (2023)
66.
go back to reference Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:2401.06209 (2024) Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:​2401.​06209 (2024)
68.
go back to reference Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016) Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
69.
go back to reference Wang, A.J., et al.: All in one: exploring unified video-language pre-training. In: CVPR (2023) Wang, A.J., et al.: All in one: exploring unified video-language pre-training. In: CVPR (2023)
70.
go back to reference Wang, X., Farhadi, A., Gupta, A.: Actions \(\tilde{\,}\) transformations. In: CVPR (2016) Wang, X., Farhadi, A., Gupta, A.: Actions \(\tilde{\,}\) transformations. In: CVPR (2016)
71.
go back to reference Wang, Y., et al.: Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022) Wang, Y., et al.: Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:​2212.​03191 (2022)
72.
go back to reference Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. In: NeurIPS (2022) Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. In: NeurIPS (2022)
73.
go back to reference Wei, C., Liu, C., Qiao, S., Zhang, Z., Yuille, A., Yu, J.: De-diffusion makes text a strong cross-modal interface. arXiv preprint arXiv:2311.00618 (2023) Wei, C., Liu, C., Qiao, S., Zhang, Z., Yuille, A., Yu, J.: De-diffusion makes text a strong cross-modal interface. arXiv preprint arXiv:​2311.​00618 (2023)
74.
go back to reference Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022)
75.
go back to reference Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: CVPR (2021) Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: CVPR (2021)
76.
go back to reference Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: AAAI (2022) Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: AAAI (2022)
77.
go back to reference Xiao, J., Zhou, P., Chua, T.S., Yan, S.: Video graph transformer for video question answering. In: ECCV (2022) Xiao, J., Zhou, P., Chua, T.S., Yan, S.: Video graph transformer for video question answering. In: ECCV (2022)
78.
go back to reference Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021) Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:​2104.​10157 (2021)
79.
go back to reference Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: NeurIPS (2022) Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: NeurIPS (2022)
80.
go back to reference Ye, Q., et al.: Hitea: hierarchical temporal-aware video-language pre-training. In: ICCV (2023) Ye, Q., et al.: Hitea: hierarchical temporal-aware video-language pre-training. In: ICCV (2023)
81.
82.
go back to reference Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022) Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:​2205.​01917 (2022)
83.
go back to reference Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988 (2023) Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:​2305.​06988 (2023)
85.
go back to reference Zellers, R., et al.: Merlot reserve: neural script knowledge through vision and language and sound. In: CVPR (2022) Zellers, R., et al.: Merlot reserve: neural script knowledge through vision and language and sound. In: CVPR (2022)
86.
go back to reference Zellers, R., et al.: Merlot: multimodal neural script knowledge models. In: NeurIPS (2021) Zellers, R., et al.: Merlot: multimodal neural script knowledge models. In: NeurIPS (2021)
87.
89.
go back to reference Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023) Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:​2303.​16199 (2023)
90.
go back to reference Zhao, Q., et al.: AntGPT: can large language models help long-term action anticipation from videos? In: ICLR (2024) Zhao, Q., et al.: AntGPT: can large language models help long-term action anticipation from videos? In: ICLR (2024)
Metadata
Title
Vamos: Versatile Action Models for Video Understanding
Authors
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_9

Premium Partner