Skip to main content

05.09.2024 | Research

Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

verfasst von: Peijin Xie, Bingquan Liu

Erschienen in: Cognitive Computation

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
2.
Zurück zum Zitat Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. arXiv:2010.11929. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. 2020. arXiv:​2010.​11929.
4.
Zurück zum Zitat Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. (eds.) Computer Vision – ECCV 2014. 2014;740–755. Springer, Cham. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. (eds.) Computer Vision – ECCV 2014. 2014;740–755. Springer, Cham.
5.
Zurück zum Zitat Kafle K, Kanan C. An analysis of visual question answering algorithms. In: ICCV. 2017. Kafle K, Kanan C. An analysis of visual question answering algorithms. In: ICCV. 2017.
9.
Zurück zum Zitat Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A corpus for reasoning about natural language grounded in photographs. 2018. arXiv:1811.00491. Suhr A, Zhou S, Zhang A, Zhang I, Bai H, Artzi Y. A corpus for reasoning about natural language grounded in photographs. 2018. arXiv:​1811.​00491.
10.
11.
Zurück zum Zitat Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021;1244–1254. Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021;1244–1254.
12.
Zurück zum Zitat Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;1244–1254 Kayser M, Camburu O-M, Salewski L, Emde C, Do V, Akata Z, Lukasiewicz T. E-vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;1244–1254
13.
14.
Zurück zum Zitat Chen Z, Wang P, Ma L, Wong K-YK, Wu Q. Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;10086–10095 Chen Z, Wang P, Ma L, Wong K-YK, Wu Q. Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;10086–10095
15.
Zurück zum Zitat Yang S, Li G, Yu Y. Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;9952–9961. Yang S, Li G, Yu Y. Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020;9952–9961.
16.
Zurück zum Zitat Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: contrastive captioners are image-text foundation models. (2022). Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: contrastive captioners are image-text foundation models. (2022).
17.
Zurück zum Zitat Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023). Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023).
18.
Zurück zum Zitat Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022;320–335.
19.
Zurück zum Zitat Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023.
20.
Zurück zum Zitat Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. 2020. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. 2020.
21.
Zurück zum Zitat OpenAI: GPT-4 Technical Report. 2023. OpenAI: GPT-4 Technical Report. 2023.
22.
Zurück zum Zitat Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023. Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
23.
Zurück zum Zitat Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. 2023. arXiv:2304.10592. Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: enhancing vision-language understanding with advanced large language models. 2023. arXiv:​2304.​10592.
26.
Zurück zum Zitat Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. VisualBERT: a simple and performant baseline for vision and language. 2019. Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. VisualBERT: a simple and performant baseline for vision and language. 2019.
27.
Zurück zum Zitat Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: Meila M, Zhang T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;139:5583–5594. http://proceedings.mlr.press/v139/kim21k.html. Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: Meila M, Zhang T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;139:5583–5594. http://​proceedings.​mlr.​press/​v139/​kim21k.​html.
28.
Zurück zum Zitat Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019. Tan H, Bansal M. Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.
29.
Zurück zum Zitat Lu J, Batra D, Parikh D, Lee S. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst. 2019;32. Lu J, Batra D, Parikh D, Lee S. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst. 2019;32.
30.
Zurück zum Zitat Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. Ernie-vil: knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:3208–3216. Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. Ernie-vil: knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:3208–3216.
31.
Zurück zum Zitat Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision. 2020;104–120. Springer. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision. 2020;104–120. Springer.
32.
Zurück zum Zitat Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020;121–137. Springer Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. 2020;121–137. Springer
33.
Zurück zum Zitat Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. 2021. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. 2021.
34.
Zurück zum Zitat Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. 2022. arXiv:2202.03052. Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. 2022. arXiv:​2202.​03052.
36.
Zurück zum Zitat Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;26296–26306. Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;26296–26306.
37.
Zurück zum Zitat Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. 2023. arXiv:2308.12966. Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. 2023. arXiv:​2308.​12966.
38.
Zurück zum Zitat Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y. Llama-adapter v2: parameter-efficient visual instruction model. 2023. arXiv:2304.15010. Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y. Llama-adapter v2: parameter-efficient visual instruction model. 2023. arXiv:​2304.​15010.
39.
Zurück zum Zitat Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W, Marathe K, Bitton Y, Gadre S, Sagawa S, et al. Openflamingo: an open-source framework for training large autoregressive vision-language models. 2023. arXiv:2308.01390. Awadalla A, Gao I, Gardner J, Hessel J, Hanafy Y, Zhu W, Marathe K, Bitton Y, Gadre S, Sagawa S, et al. Openflamingo: an open-source framework for training large autoregressive vision-language models. 2023. arXiv:​2308.​01390.
40.
Zurück zum Zitat Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019 Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019
41.
Zurück zum Zitat Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: a benchmark for visual question answering using world knowledge. arXiv. 2022. Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: a benchmark for visual question answering using world knowledge. arXiv.​ 2022.
42.
Zurück zum Zitat Lu P, Mishra S, Xia T, Qiu L, Chang K.-W, Zhu S.-C, Tafjord O, Clark P, Kalyan A. Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS). 2022. Lu P, Mishra S, Xia T, Qiu L, Chang K.-W, Zhu S.-C, Tafjord O, Clark P, Kalyan A. Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS). 2022.
44.
Zurück zum Zitat Feng J, Sun Q, Xu C, Zhao P, Yang Y, Tao C, Zhao D, Lin Q. MMDialog: a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. 2022. Feng J, Sun Q, Xu C, Zhao P, Yang Y, Tao C, Zhao D, Lin Q. MMDialog: a large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. 2022.
45.
Zurück zum Zitat Ustalov D, Pavlichenko N, Likhobaba D, Smirnova A. WSDM Cup 2023 Challenge on visual question answering. In: Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, Singapore. 2023;1–7. http://ceur-ws.org/Vol-3357/invited1.pdf. Ustalov D, Pavlichenko N, Likhobaba D, Smirnova A. WSDM Cup 2023 Challenge on visual question answering. In: Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, Singapore. 2023;1–7. http://​ceur-ws.​org/​Vol-3357/​invited1.​pdf.
46.
Zurück zum Zitat Liu F, Emerson G, Collier N. Visual spatial reasoning. Trans Assoc Computat Linguist. 2023;11:635–51.CrossRef Liu F, Emerson G, Collier N. Visual spatial reasoning. Trans Assoc Computat Linguist. 2023;11:635–51.CrossRef
49.
Zurück zum Zitat Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R. Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
50.
Zurück zum Zitat Si Q, Meng F, Zheng M, Lin Z, Liu Y, Fu P, Cao Y, Wang W, Zhou J. Language prior is not the only shortcut: a benchmark for shortcut learning in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022;3698–3712. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.271 Si Q, Meng F, Zheng M, Lin Z, Liu Y, Fu P, Cao Y, Wang W, Zhou J. Language prior is not the only shortcut: a benchmark for shortcut learning in VQA. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022;3698–3712. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://​aclanthology.​org/​2022.​findings-emnlp.​271
51.
Zurück zum Zitat Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In: European Conference on Computer Vision. 2016. Lu C, Krishna R, Bernstein M, Fei-Fei L. Visual relationship detection with language priors. In: European Conference on Computer Vision. 2016.
52.
Zurück zum Zitat Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7W: grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016. Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7W: grounded question answering in images. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016.
53.
Zurück zum Zitat Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vision. 2020;128. https://doi.org/10.1007/s11263-020-01316-z. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vision. 2020;128. https://​doi.​org/​10.​1007/​s11263-020-01316-z.
54.
Zurück zum Zitat Xu D, Zhu Y, Choy C, Fei-Fei L. Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR). 2017. Xu D, Zhu Y, Choy C, Fei-Fei L. Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR). 2017.
56.
Zurück zum Zitat Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. 2020. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V. The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. 2020.
57.
Zurück zum Zitat Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, Qiu Z, Lin W, Yang J, Zheng X, et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. 2023. arXiv:2306.13394. Fu C, Chen P, Shen Y, Qin Y, Zhang M, Lin X, Qiu Z, Lin W, Yang J, Zheng X, et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. 2023. arXiv:​2306.​13394.
Metadaten
Titel
Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data
verfasst von
Peijin Xie
Bingquan Liu
Publikationsdatum
05.09.2024
Verlag
Springer US
Erschienen in
Cognitive Computation
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-024-10351-8