Skip to main content
Top

05-09-2024 | Research

Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Authors: Peijin Xie, Bingquan Liu

Published in: Cognitive Computation

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
2.
go back to reference Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. arXiv:2010.11929. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. arXiv:​2010.​11929.
4.
go back to reference Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. (eds.) Computer Vision – ECCV 2014. 2014;740–755. Springer, Cham. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. (eds.) Computer Vision – ECCV 2014. 2014;740–755. Springer, Cham.
5.
go back to reference Kafle K, Kanan C. An analysis of visual question answering algorithms. In: ICCV. 2017. Kafle K, Kanan C. An analysis of visual question answering algorithms. In: ICCV. 2017.
9.
go back to reference Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A corpus for reasoning about natural language grounded in photographs. 2018. arXiv:1811.00491. Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A corpus for reasoning about natural language grounded in photographs. 2018. arXiv:​1811.​00491.
10.
11.
go back to reference Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021;1244–1254. Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021;1244–1254.
12.
go back to reference Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;1244–1254 Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;1244–1254
13.
14.
go back to reference Chen Z, Wang P, Ma L, Wong K-YK, Wu Q. Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;10086–10095 Chen Z, Wang P, Ma L, Wong K-YK, Wu Q. Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;10086–10095
15.
go back to reference Yang S, Li G, Yu Y. Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;9952–9961. Yang S, Li G, Yu Y. Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;9952–9961.
16.
go back to reference Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: contrastive captioners are image-text foundation models. (2022). Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: contrastive captioners are image-text foundation models. (2022).
17.
go back to reference Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023). Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023).
18.
go back to reference Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335.
19.
go back to reference Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023.
20.
go back to reference Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. 2020. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. 2020.
21.
22.
go back to reference Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023. Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
23.
go back to reference Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. 2023. arXiv:2304.10592. Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. 2023. arXiv:​2304.​10592.
26.
go back to reference Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. VisualBERT: a simple and performant baseline for vision and language. 2019. Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. VisualBERT: a simple and performant baseline for vision and language. 2019.
27.
go back to reference Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: Meila M, Zhang T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;139:5583–5594. http://proceedings.mlr.press/v139/kim21k.html. Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: Meila M, Zhang T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;139:5583–5594. http://​proceedings.​mlr.​press/​v139/​kim21k.​html.
28.
go back to reference Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019. Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.
29.
go back to reference Lu J, Batra D, Parikh D, Lee S. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst. 2019;32. Lu J, Batra D, Parikh D, Lee S. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst. 2019;32.
30.
go back to reference Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. Ernie-vil: knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:3208–3216. Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. Ernie-vil: knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:3208–3216.
31.
go back to reference Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision. 2020;104–120. Springer. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision. 2020;104–120. Springer.
32.
go back to reference Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020;121–137. Springer Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020;121–137. Springer
33.
go back to reference Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. 2021. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. 2021.
34.
go back to reference Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. 2022. arXiv:2202.03052. Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. 2022. arXiv:​2202.​03052.
36.
go back to reference Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;26296–26306. Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;26296–26306.
37.
go back to reference Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. 2023. arXiv:2308.12966. Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. 2023. arXiv:​2308.​12966.
38.
go back to reference Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y. Llama-adapter v2: parameter-efficient visual instruction model. 2023. arXiv:2304.15010. Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y. Llama-adapter v2: parameter-efficient visual instruction model. 2023. arXiv:​2304.​15010.
39.
go back to reference Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W, Marathe K, Bitton Y, Gadre S, Sagawa S, et al. Openflamingo: an open-source framework for training large autoregressive vision-language models. 2023. arXiv:2308.01390. Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W, Marathe K, Bitton Y, Gadre S, Sagawa S, et al. Openflamingo: an open-source framework for training large autoregressive vision-language models. 2023. arXiv:​2308.​01390.
40.
go back to reference Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019 Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019
41.
go back to reference Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: a benchmark for visual question answering using world knowledge. arXiv. 2022. Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: a benchmark for visual question answering using world knowledge. arXiv.​ 2022.
42.
go back to reference Lu P, Mishra S, Xia T, Qiu L, Chang K.-W, Zhu S.-C, Tafjord O, Clark P, Kalyan A. Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS). 2022. Lu P, Mishra S, Xia T, Qiu L, Chang K.-W, Zhu S.-C, Tafjord O, Clark P, Kalyan A. Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS). 2022.
44.
go back to reference Feng J, Sun Q, Xu C, Zhao P, Yang Y, Tao C, Zhao D, Lin Q. MMDialog: a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. 2022. Feng J, Sun Q, Xu C, Zhao P, Yang Y, Tao C, Zhao D, Lin Q. MMDialog: a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. 2022.
45.
go back to reference Ustalov D, Pavlichenko N, Likhobaba D, Smirnova A. WSDM Cup 2023 Challenge on visual question answering. In: Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, Singapore. 2023;1–7. http://ceur-ws.org/Vol-3357/invited1.pdf. Ustalov D, Pavlichenko N, Likhobaba D, Smirnova A. WSDM Cup 2023 Challenge on visual question answering. In: Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, Singapore. 2023;1–7. http://​ceur-ws.​org/​Vol-3357/​invited1.​pdf.
46.
go back to reference Liu F, Emerson G, Collier N. Visual spatial reasoning. Trans Assoc Computat Linguist. 2023;11:635–51.CrossRef Liu F, Emerson G, Collier N. Visual spatial reasoning. Trans Assoc Computat Linguist. 2023;11:635–51.CrossRef
49.
go back to reference Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
50.
go back to reference Si Q, Meng F, Zheng M, Lin Z, Liu Y, Fu P, Cao Y, Wang W, Zhou J. Language prior is not the only shortcut: a benchmark for shortcut learning in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022;3698–3712. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.271 Si Q, Meng F, Zheng M, Lin Z, Liu Y, Fu P, Cao Y, Wang W, Zhou J. Language prior is not the only shortcut: a benchmark for shortcut learning in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022;3698–3712. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://​aclanthology.​org/​2022.​findings-emnlp.​271
51.
go back to reference Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In: European Conference on Computer Vision. 2016. Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In: European Conference on Computer Vision. 2016.
52.
go back to reference Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7W: grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016. Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7W: grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016.
53.
go back to reference Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vision. 2020;128. https://doi.org/10.1007/s11263-020-01316-z. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vision. 2020;128. https://​doi.​org/​10.​1007/​s11263-020-01316-z.
54.
go back to reference Xu D, Zhu Y, Choy C, Fei-Fei L. Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR). 2017. Xu D, Zhu Y, Choy C, Fei-Fei L. Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR). 2017.
56.
go back to reference Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. 2020. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. 2020.
57.
go back to reference Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, Qiu Z, Lin W, Yang J, Zheng X, et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. 2023. arXiv:2306.13394. Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, Qiu Z, Lin W, Yang J, Zheng X, et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. 2023. arXiv:​2306.​13394.
Metadata
Title
Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data
Authors
Peijin Xie
Bingquan Liu
Publication date
05-09-2024
Publisher
Springer US
Published in
Cognitive Computation
Print ISSN: 1866-9956
Electronic ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-024-10351-8

Premium Partner